Cascaded Machine-Learning Technique for Debris Classiﬁcation in Floor-Cleaning Robot Application

: Debris detection and classiﬁcation is an essential function for autonomous ﬂoor-cleaning robots. It enables ﬂoor-cleaning robots to identify and avoid hard-to-clean debris, speciﬁcally large liquid spillage debris. This paper proposes a debris-detection and classiﬁcation scheme for an autonomous ﬂoor-cleaning robot using a deep Convolutional Neural Network (CNN) and Support Vector Machine (SVM) cascaded technique. The SSD (Single-Shot MultiBox Detector) MobileNet CNN architecture is used for classifying the solid and liquid spill debris on the ﬂoor through the captured image. Then, the SVM model is employed for binary classiﬁcation of liquid spillage regions based on size, which helps ﬂoor-cleaning devices to identify the larger liquid spillage debris regions, considered as hard-to-clean debris in this work. The experimental results prove that the proposed technique can efﬁciently detect and classify the debris on the ﬂoor and achieves 95.5% percent classiﬁcation accuracy. The cascaded approach takes approximately 71 milliseconds for the entire process of debris detection and classiﬁcation, which implies that the proposed technique is suitable for deploying in real-time selective ﬂoor-cleaning applications.


Introduction
Floor-cleaning robots are widely used in food courts, hospitals, industries, and homes for picking up debris (dust, dirt, liquid spillage debris) and mopping floors. The critical challenge of floor-cleaning robots is to determine the type of debris and recognize easy-to-clean and hard-to-clean debris, specifically large liquid spillage debris. Larger debris is hard to clean for most of the floor-cleaning robots, thereby avoiding them is a better option rather than to spread them on the floor following a failed attempt to clean [1]. Recognizing and avoiding such larger debris is important for efficient cleaning by floor-cleaning robots. Typically, current floor-cleaning robots use piezoelectric debris sensors or optical sensors for recognizing debris on the floor. However, these sensing devices can identify debris only when it appears on the sensor and cannot be used for classifying debris types such as solid and liquid [1][2][3]. Hence, cleaning devices cannot avoid hard-to-clean liquid debris regions (e.g., large liquid spillage debris) and cannot predetermine the amount of cleaning effort required to achieve better cleaning for small-sized liquid spillage debris [1]. Machine vision-based debris recognition is an emerging technique for autonomous cleaning robots. It provides an efficient solution for recognizing the debris on the floor [4][5][6]. Andersen et al. proposed a computer vision-based dusty area detection. The authors use a multivariable statistical method for recognizing the dirty areas and generate a cleanness map for the floor-cleaning robot [4]. Borman et al. proposed a debris-recognition technique for floor-cleaning robots where spectral residual image filter is used for detecting the dust and dirt debris on the floor [5]. Canny edge detection and Maximally Stable Extremal Regions (MSER)-based machine vision is proposed for floor-cleaning robots to recognize and classify the dust and mud detection on the floor [7]. Andreas et al. suggested an unsupervised learning-based dust and dirt detection algorithm. The authors use a Gaussian Mixture Model (GMM) to recognize the dust spots on the floor [8]. David et al. [9] developed a computer vision-based dust detection algorithm using median filter and thresholding technique where the median filter is used to remove the background of the surface and thresholding technique is used to recognize the debris on the floor. However, these techniques have shortcomings. The schemes can identify debris on the floor but cannot classify the debris types.
Deep-learning-based computer vision is an emerging technique for automatic recognition and classification. Deep-learning architectures have many hidden layers of neurons and the neural network architecture can be modified and optimized to solve different complex tasks in computer vision-based applications [10][11][12][13][14][15][16][17]. Chen et al. [10] proposed a computer vision-based robot grasping system for automatically sorting garbage where a Fast RCNN is employed for detecting different objects in the scene. Deep-learning-based head-pose detection is proposed by Jaishankar et al. for wearable pet robots [11]. The authors use AlexNet CNN architecture for implementing the head-pose detection system. Sa et al. [12] proposed a CNN-based fruit-detection system for an autonomous agricultural robotic platform. Here, the Faster RCNN framework is used for detecting the various kinds of fruit in the field. Saeed and Mathew developed a vision system for autonomous sewer robots using five-layer CNN architecture. The developed CNN architecture is used to detect and characterize cracks in a pipeline [13]. Li et al. proposed a surface defect detection algorithm. Here, Single-Shot MultiBox Detector (SSD) MobileNet network is employed to detect the surface defect [14]. Customized CNN architecture for debris classification and specific object-detection applications has been reported in [18,19]. However, customized CNN layers can heavily affect the object-detection accuracy and would not yield any better results when two object classes have very similar results. This can be overcome by combining CNN with other models such as SVM, which can solve simple classification problems such as classification based on size. This also provides better accuracy over fixed thresholds, while being lightweight, which enables real-time implementation. Recently, it was seen that some works included different combinations of both the CNN and the SVM to resolve some of the complex issues and improve the object detection and classification accuracy. One study carried out face detection using CNN [20]. SVMs were integrated with this model using kernel combination. Another study applied this technique to scene recognition by using features from different layers of CNN in the SVM [21]. The CNN-SVM classifier has also been used for recognizing handwritten digits [22]. In [23], Ucar et al. developed a hybrid object recognition and detection scheme for an autonomous driving vehicle by combining CNN and SVM technique where multiple CNNs are employed for determining local robust features and SVM recognizes and detects all objects. Even though there exists much literature demonstrating the benefits of deep-learning technique for various robotic application, none of this work is targeted towards debris classification for indoor floor-cleaning robots using deep-learning techniques. In this work, we present a debris-detection and classification task for floor-cleaning robot applications using cascaded SSD MobileNet and SVM frameworks. By adjusting the network parameters and structure, the deep-learning network can recognize and classify the debris as solids or liquids. Furthermore, the SVM classifier [24][25][26] is used to classify the CNN detected liquid spillage as smaller or larger regions. The experimental results demonstrate that the proposed cascaded scheme achieves better debris detection and classification capabilities than Faster RCNN Inception architecture and achieves 97.5% classification accuracy. The CNN-SVM cascaded approach takes approximately 71 milliseconds for debris detection and classification on the captured floor image. It implies that the proposed technique can be implemented in real time for floor-cleaning robots to recognize debris and avoid mess which is hard to clean.
The remainder of this article is organized as follows: related work is reported in Section 2 and a brief introduction of CNN and SVM is given in Section 3. Section 4 describes the debris-detection and classification methodology in detail. The experimental results are given in Section 5. Finally, conclusions and future work are provided in Section 6.

Related Work
This section describes other literature on debris detection or classification. Both deep-learning and other techniques such as SVM have been used for garbage detection, but primarily only for outdoor situations with solid garbage. Also, most of these techniques consider only solid garbage. Gaurav et al. [27] developed a smartphone app to detect and localize the outdoor garbage. The authors used a pre-trained AlexNet CNN model for detecting garbage from outdoor captured images where Bing image search is used to collect the debris data set. The model has a classification accuracy of 87% for this application. However, their approach does not provide any information regarding the type of garbage and uses segmentation to detect regions, which is not very accurate. Another approach involved using an SVM with SIFT features to segregate garbage into recycling categories is provided in [19]. This method uses cropped images of solid garbage only for classification but does not detect the location of the trash. The method has an accuracy of 94% for the given task. The overfeat-googlenet model was trained with debris present in outdoor environments by Rad et al. [28]. The authors used 18,672 images of various types of litters and wastes to train CNN for identifying the outdoor environments solid debris such as leaves, newspapers, food packages, cans etc. The scheme achieved a precision of 68.27% for debris detection in this application. CNN architectures have also been used for underwater debris detection using FLS images [18]. This model was trained using images of common marine debris. Recently, Fulton et al. [15] evaluated various deep-learning frameworks for implementing the marine debris detectors in autonomous underwater vehicles. The authors realize that CNN and SSD have higher accuracy when compared to YOLOv2 and Tiny YOLO frameworks. However, SSD requires higher processing times.

Convolutional Neural Networks (CNN)
CNNs are very common in the field of image classification. They achieve high accuracy in many classification tasks by automatically detecting the features in a dataset of images. CNNs are better compared to traditional methods, where the filters that enable feature extraction have to be specified before computation. A typical CNN consists of four layers: a Convolution layer, Pooling layer, ReLU layer and Fully Connected layer. Each layer and its function is described here briefly. The CNN architecture, in general, is shown in Figure 1. A convolution layer extracts features, a pooling layer combines these features and reduces dimension, ReLU layers remove negative values, and the fully connected layers are responsible for classification.
The object-detection CNN architecture used in this paper comprises of two parts-a feature extractor and a bounding box predictor.

Feature Extractors
Feature Extractors extract specialized features pertaining to a particular classification problem. These features are distinct for the given problem and can be used to distinguish between multiple classes. Specialized CNN architectures such as YOLO [29], MobileNet v2 [30], AlexNet [31], Inception v2 [32], and ResNet [33,34] are used for feature extraction [35]. MobileNet architecture is a prominent architecture for real-time scenarios. It uses depthwise separable convolution layers, which have a reduced computation cost compared to standard convolution layers. MobileNet v2 is an improved version of the MobileNet architecture. This architecture is highly efficient and can be deployed in low-power real-time scenarios. Inception is a popular architecture for feature extraction. It has a good balance between accuracy and cost but is far costlier than MobileNet. The architecture was developed after VGG and is more accurate and computationally efficient. It employs convolution layer factorization, which splits larger convolutions into a combination of smaller ones to achieve this. Inception v2 is used for a comparison of performance for the given task. ResNet is a state-of-the-art feature extractor, which obtains a high accuracy using residual learning. This improvement in accuracy is with 'shortcuts', which are the connections that skip levels. However, ResNets have a very high computation cost and cannot be employed in real-time systems.

Bounding Box Predictors
Bounding box predictors locate the objects in an image using feature maps extracted by the feature extractor. Two popular networks exist for this [35]-Faster RCNN [36] and SSD [37]. Faster RCNN is a commonly used algorithm for predicting bounding boxes. Faster RCNN is highly accurate. It uses a Region Proposal Network (RPN) to predict potential regions and a fully connected network for classification. This fully connected network uses both the feature map and the output of the RPN. Although Faster RCNN is efficient and accurate, it is not fast enough for real-time operations. SSD is more suited for real-time operations.

Support Vector Machine (SVM)
SVM is commonly used in classification problems [24][25][26]. SVM constructs a hyperplane as a separation between two or more classes. This hyperplane is constructed based on the distances from various points of different classes from it. The standard form of a hyperplane is given in Equation (1).
where w is a vector normal to the described hyperplane, referred to as the weight, and b represents the bias. For any data set containing points ( x, y), with the class label y having two values +1 and −1, the hyperplane separating the two classes satisfies Equations (2) and (3).
However, this boundary may not be perfect. This can be due to the data not being completely separable. Hence, determining the hyperplane becomes a minimization problem where the hinge loss is minimized. Equation (4) describes the hinge loss function.
Another scenario when the boundary may not be perfect is when the data is not linearly separable.
Here, kernel functions such as Radial Bias Function (RBF), linear, Polynomial, Gaussian, etc., can be employed for classification with SVM.

Methodology
This section describes the enhanced deep-learning approach used for debris recognition and classification. Figure 2 illustrates the functional description of the proposed scheme. The entire design flow is divided into two phases-the training phase and the testing phase. The proposed scheme uses MobileNet v2 for feature extraction and SSD for detection and classification of the floor debris into solid and liquid spillage debris. Then SVM is used to classify the liquid spillage debris regions based on spillage size as being small or large. Typically, identifying the size of the object from the boundary box outputs of CNN is the possible method, but this method faces several drawbacks. Firstly, it is based on the hard-coded threshold which is set manually and is hard to determine accurately [18,19]. Secondly, for finding the actual object size by boundary box method, we need very accurate intrinsic and extrinsic cameras parameters. The problem of camera calibration errors between the detected object in world frame and detected objects in the image frame are difficult to address in dynamic environments. Using SVM and with given training data, the input boundary box can be classified autonomously into two classes, i.e., hard-to-clean or easy-to-clean objects.

MobileNet V2 for Feature Extraction
MobileNet v2 is a lightweight feature extractor that can be run in real time on low-power embedded systems. A significant change over standard CNN architectures is the use of depthwise convolution layers. These layers have a lesser computation cost compared to standard convolution layers. The depthwise separable convolution layers work on the principle of factorization. The standard convolution layers are factorized into two layers: a depthwise convolution layer and a pointwise convolution layer. This combination drastically reduces the computational complexities of the traditional convolution layers. Inverted residual layers are also used. This layer is based on the structure of convolution layers in the ResNet architecture i.e., residual learning which uses 'shortcuts'. These improve the accuracy of the depthwise convolution layer while not having a large overhead. Bottleneck layers that reduce the size of input are used as well. These reduce the computation time drastically. An upper bound is applied on the ReLU layer, which serves to limit the overall complexity. Structure of the standard convolution layers used in MobileNetv2 is shown in Figure 3. Since the location of debris is also required, a bounding box predictor is used in place of the fully connected layers present in the feature extractor.

SSD for Bounding Box Prediction
Bounding box predictors output the bounding box coordinates and type of class along with the confidence level of classification. These coordinates can be translated into real-world coordinates to help the floor-cleaning robot avoid larger spillage debris. SSD is used in this paper due to its low computation cost, thereby allowing it to be deployed in real-world systems such as small floor-cleaning robots. It classifies the debris into solids and liquids spillage.
SSD uses a CNN architecture to predict a set of bounding boxes over the region where a particular class is detected. These bounding boxes have a fixed size i.e., their sizes are selected from a set of predefined values. This modification helps to reduce the overall computation cost of these boxes. Although the boxes may not perfectly enclose the spill, the error is usually negligible. Since the boxes are of fixed sizes, the network employs multi-scale feature maps, which help to recognize objects of different scales. The network predicts a score along with the box coordinates, which indicates its confidence in the given prediction.
Since bounding boxes have both location and class type are parameters, the loss cannot be directly computed. A special measure must be used which is a combination of these parameters. The overall loss in the prediction of bounding boxes is given in Equation (5).
where N is the number of boxes predicted and α is a weight parameter. L con f is the SoftMax loss obtained from classification. L loc is the L 1 error obtained from location estimation of each box.

Training Phase
Aver vision system and Kinect v2 RGB-D sensor are used in real-world scenarios to capture images for training the network. Two viewpoints were taken into consideration: top view and robot view. About 2000 training images are captured at dynamic floor backgrounds with various classes of food debris, including solid food waste, semi-solids, and large liquid spillage. Besides, to improve the CNN learning rate and prevent over-fitting, data expansion is applied to the captured images. Data Expansion increases the number of images in the dataset by using simple geometrical transformations such as rotation, scaling and flipping. Furthermore, to reduce the computing load in the training stage, the captured images are resized into 640 × 480 pixels. The dataset is balanced across the two classes. There are around 900 images for each class. These images contain one or more debris of a single type (solid or liquid). The remaining images consist of mixed debris which contain both solid and liquid debris in the same image. The number of debris per image on an average is 2. These images are labeled manually for training the network by drawing bounding boxes over the detected debris.
The grid search algorithm determines some of the hyperparameters in the CNN and the hyperparameter λ for the SVM. It is using different combinations of possible values for finding the optimal hyperparameters. The combination that results in the highest accuracy is chosen. For CNN, the initial learning rate is set to 0.002. The decay factor for RMS prop is set to 0.9. A batch size of 32 is used and the number of epochs for training is 10,000. The other hyperparameters of the model are inherited from a pre-trained model (trained on MS COCO dataset) [35]. Similarly, the hyperparameter λ for SVM is obtained as 1.0.
A batch size of 20 is used while training the CNN. The loss is optimized through the Root Mean Squared Propagation (RMS Prop) [38] algorithm with a decay factor of 0.9. RMS prop uses gradient g t , weight w rms t at any time t, exponential average v t to estimate the weight at time t + 1. Equation (8) is used to update the weights in the network. Here, η is the initial learning rate, set to 0.002. β is a hyperparameter tuned by the network. is a parameter introduced to prevent divide by zero errors.
The network is implemented in the TensorFlow framework on Ubuntu 16.04. A system with the following hardware configuration is used for training and testing: Intel Xeon E5-1600 V4 CPU with 64 GB RAM and an NVIDIA Quadro P4000 GPU with 12 GB Video memory.
The computation of performance metrics involves the use of K-fold cross-validation. In this technique, the dataset is divided into K subsets with K−1 subsets used for training and the remaining subset for evaluating the performance. This process is repeated K times to get the mean accuracy and other performance metrics of the model. K-fold cross-validation is done to ensure that the figures reported are accurate and not biased towards a particular dataset split. This work uses 10-fold cross-validation. The images shown are obtained from the model with the highest accuracy.

SVM for Error Reduction and Spill Size-Based Classification
Not all liquid spillage debris are hard to clean. While a robot may not be able to clean the solid debris, it should also not make a mess out of it. However, the same is not the case with liquid spills, where a classifier based on the spill size is required to prevent a mess and damage to cleaning equipment.
Here, SVM is employed to classify the size of liquid spillage debris detected. SVM would have a higher accuracy over a fixed threshold, which may not accurately determine the border cases. Also, SVM is lightweight on computational requirements, which helps for real-time implementation of this scheme. A soft margin SVM is used. The hyperplane boundary is modeled into a minimization problem shown in Equation (9).
where || m|| 2 is a term used adjusting the margin size of a soft margin SVM classifier. Also, since the data is not linearly separable, a RBF kernel is used. The RBF kernel K on two points x 1 and x 2 is given in Equation (10).
where σ is a free parameter. Two viewpoints are used in this paper for spill size detection, i.e., viewing from a top perspective and viewing from a robot's perspective. For both perspectives, three features-Area, Class, and Confidence are used in this classification.

Area:
The area is used as a measure of the size of the spill. The area is estimated using a transformation as given in Equation (11). Area = Width real × Height real (11) Width real = Width box × PPI (12) where Area is the estimated area, Width real and Height real are the real width and height computed from Equations (12) and (13). These equation use Pixel Per Inch(PPI), computed by Equation (14). prediction.con f idence) 5 end

Experiments and Analysis
This section describes the experimental results of the proposed scheme. The proposed system should be able to detect and classify different types of debris. The scheme should also be able to classify liquid spillage debris based on their size. Hence, two experiments are conducted to validate this scheme. The first experiment is to assess the performance of debris classifier using real-world test images. The model has not been exposed to these images during the training phase. The second experiment is to assess the performance of liquid spill size classifier to avoid hard-to-clean debris.

Performance Metrics
Standard statistical measures are used to assess the performance of the proposed scheme. These include accuracy, precision, recall and F measure which are computed as shown in Equations (15)- (18). Here, tp, f p, tn, f n represents the true positives, false positives, true negatives, and false negatives respectively as per the standard confusion matrix. In Equations (19) and (20), η miss is for the target debris not recognized by the network and η f alse is for other objects detected as debris.
Recall(Rec) = tp tp + f n (17) Figures 4 and 5 shows the typical detection results for various types of floor debris captured at both top view and robot view with different angle. Here, solid debris is marked by a green rectangle box and liquid spills region are marked by a sky-blue rectangle box. The imaging system is mounted at the height of 150 cm from the floor for the top view and 60 cm for robot perspective .

Debris Classification
The experimental results show that the trained SSD MobileNet architecture can detect most of the solid and liquid spill debris (97-98% of them) on the floor. Solid debris that is detected typically have a 98% or higher confidence level and liquid spillage debris have a 96% or higher confidence level, respectively. The overall accuracy of this classifier on both perspectives is shown in Table 1.    Furthermore, to validate the robustness of the proposed scheme, debris classification has been applied to regions where a mixture of solid and liquid debris is present. The performance of the scheme on these scenarios is shown in Figure 6. The classification results indicate that the trained SSD MobileNet architecture can determine solid and liquid debris present in these scenarios very accurately.

Liquid Spill Size Classification
This experiment is performed with liquid spillage of varying compositions. Spillages made of everyday food products such as eggs, yogurt, milk, etc. are spread on the test bed with various sizes and captured by Aver Visualizer Module [39]. The Visualizer Module comprises of a 5 megapixel (MP), 180 • Field of View CMOS camera sensor with a flexible arm for adjusting the camera height and tilt from −90-90 • degrees as shown in Figure 7.
In our experiment, floor images are captured at 60 cm from the ground and the camera head is inclined at an angle of 45 degrees downwards for robot perspective (Figure 7a) and 90 degrees downwards (Figure 7b) for top perspective. The focused area has been measured accurately through a laser marker on-board the vision system. It helps to compute and verify the pixel-based bounding box area estimation. Then, the captured images are sent to CNN architecture for debris classification. Finally, the detected liquid spillage debris region is evaluated by an SVM to classify the liquid spillage debris based on size into large and small regions. Figure 8 shows the SVM classification results for various size of liquid spillage debris. A Violet bounding box indicates large liquid spillage debris and a red bounding box indicates small liquid spillage debris. For solid debris, the threshold is determined according to the size of vacuuming inlet of the cleaning device. The standard inlet size of commercial floor-cleaning robots is reported in Table 2. According to vacuuming inlet dimension, we can fix the threshold for solid objects to determine whether the detected objects should be avoided or cleaned. Furthermore, larger (in height) solid debris higher than ground clearance height (the gap between robot base and ground) of the robot is considered to be an obstacle which can be detected by the bump sensor present in the robot and is automatically excluded from cleaning. Table 2. Inlet Size of some common cleaning robots.

Comparison of Performance of Different Architectures
To assess the efficiency of our proposed scheme, the classification accuracy of the proposed scheme has been compared with more robust networks such as Faster RCNN ResNet and Faster RCNN Inception CNN architectures. Both networks have been trained using the same dataset for a similar amount of time.
The metrics indicate that Faster RCNN ResNet performs the best compared to the other model. Faster RCNN Inception generally performs better compared to SSD MobileNet but not in the scenario of debris type classification. In contrast to the other schemes, SSD MobileNet uses a significantly lesser amount of time for each image. Also, SSD MobileNet is optimized for low powered hardware which makes it possible to deploy the network in real-time situations using low powered computing hardware such as Intel Movidius Neural Compute Stick and Raspberry Pi. Hence, the proposed scheme can be implemented on a small cleaning robot for real-time debris detection and classification.
The garbage detection and classification results for Faster RCNN Resnet, Faster RCNN Inception and the proposed scheme are shown in Table 3. The comparison is made using standard metrics discussed above. Also, timing analysis has been performed to estimate the average detection time of each network. It has been tested on NVIDIA Quadro P4000 graphic card and reported in Table 4. The comparison of predictions results of three schemes on two given images is shown in Figure 9.

Comparison with Existing Schemes
This section describes the differences analysis of debris detection and novel-object-detection application using deep-learning technique. Table 5 shows the difference in proposed implementation with existing CNN-based debris-detection and classification schemes. The difference has been listed based on CNN architecture, network hyperparameter, and object-detection accuracy. Faster RCNN architecture has higher accuracy compared to SSD for debris detection and other object-detection scenarios for different feature extractors. However, SSD is faster because it uses a defined set of sizes for the bounding box and uses the more-efficient depthwise convolution layers. These contribute to a lower computation cost but results in a reduction of accuracy. YOLO-based networks are faster than SSD but have a poor accuracy in most tasks. A comparison of the performance of these networks is shown in Table 6. SSD MobileNet provides a good balance between accuracy and computation cost and is fast enough to be used in real-time frameworks for object detection and specifically debris classification. Comparison with non-deep-learning-based approaches is shown in Table 7. Deep-learning techniques offer a large advantage in scenarios when sufficient data is available. These techniques can autonomously extract features from images, which allow them to learn features and patterns which are difficult to figure out statistically. CNN architectures are specifically good for extracting features from images since they use a combination of pixels next to each other for features.

Conclusions and Future Work
This work proposed debris detection and liquid spillage debris classification based on size for floor-cleaning robot application using cascaded machine-learning algorithms involving SSD Mobile net deep-learning framework and SVM classifier. The feasibility of the proposed method was verified with real debris images captured with the top perspective and the robot perspective. The experimental results show that the proposed scheme can identify and classify both solid and liquid spillage debris with more than 96% accuracy in both perspectives and achieves better detection and classification performance than existing deep-learning schemes and other computer vision-based debris classification techniques. Furthermore, for deploying in the real floor-cleaning robotic platform, the robot perspective has added advantages than top perspective. Therefore, in most of the vision system-enabled floor-cleaning devices, the vision modules are fixed in robot perspective and can cover the large floor area (approximately 400 cm coverage area from current position). These factors make the robot perspective a better approach for spill detection, which can be easily deployed in any specific existing or new platform.
As the proposed technique has the potential to be implemented in real-world applications, we plan to deploy this scheme in reconfigurable robots being developed at SUTD, Singapore [45,46]. Furthermore, we plan to bridge the vSLAM algorithm with the proposed scheme to resolve the occlusion issues in real-time implementation on floor-cleaning robots. Through vSLAM techniques, robots can compute the debris location and update the debris location in vision-based mapping systems, which helps track the detected object even if it is occluded by another object.