Vision Based Wall Following Framework: A Case Study With HSR Robot for Cleaning Application

Periodic cleaning of all frequently touched social areas such as walls, doors, locks, handles, windows has become the first line of defense against all infectious diseases. Among those, cleaning of large wall areas manually is always tedious, time-consuming, and astounding task. Although numerous cleaning companies are interested in deploying robotic cleaning solutions, they are mostly not addressing wall cleaning. To this end, we are proposing a new vision-based wall following framework that acts as an add-on for any professional robotic platform to perform wall cleaning. The proposed framework uses Deep Learning (DL) framework to visually detect, classify, and segment the wall/floor surface and instructs the robot to wall follow to execute the cleaning task. Also, we summarized the system architecture of Toyota Human Support Robot (HSR), which has been used as our testing platform. We evaluated the performance of the proposed framework on HSR robot under various defined scenarios. Our experimental results indicate that the proposed framework could successfully classify and segment the wall/floor surface and also detect the obstacle on wall and floor with high detection accuracy and demonstrates a robust behavior of wall following.


Introduction
Cleaning of high-touch areas in a physical environment is very critical in preventing the spread of infectious diseases. One of the recent surveys on COVID-19 spread shows that the common mode of infection spread is by touching infected surfaces [1]. The frequently touched areas that include walls, doors, locks, handles, windows need to be cleaned within a fixed time interval. Among those, cleaning of large wall surfaces manually is always seen as a tedious, time consuming, and astounding task. Also, during the period of spread, there is a high possibility and risk where a human cleaner may get infected. To overcome such bottlenecks, and to reduce the risk of infection transmission, the cleaning contractors are encouraged to use robotic solutions by every developed nation. The use of robotic cleaners for professional cleaning services has been increased gradually in the past few years. Recent research done by the robotic industrial association states that more than 3000 industrial cleaning robots will be sold between 2019 and 2021. Successful market players include avid bot, eco bot, lions bot, Adlatus robotics and so on. Even though there are numerous market available robotic platforms which are considered as a replacement for the professional cleaners, most of these systems are concentrated only in floor cleaning, and not considered the wall cleaning. In the past two decades, numerous wall cleaning robots were researched with the wall climbing capability.
For instance, robots that climb vertical surfaces by vacuum-suction to a surface were created for the very purpose of wall cleaning. These adhesive robots are, however, known to carry the risk feature pyramid based hybrid CNN network was proposed by Li et al. [13] for deploying the obstacle avoidance system in unmanned surface vehicles. The Hybrid CNN network was built by combining the ResNet and DenseNet CNN framework and gained 93.80 mean Average Precision (mAP) for testing with the USVD2018 data-set.
Besides, DL algorithms are widely used in cleaning robots for trash detection, path planning and ARM manipulation, etc. In [14] author uses the deep learning framework in cleaning robot for recognizing and localizing the trashes from the robot vision module, where the author used deep CNN based semantic segment framework SegNet for segmenting out the ground region from other areas and ResNet for object detection and localization. The author reported that the combined approach obtained 96% trash detection accuracy and took 10.3 ms for ground segmentation and 8.1 ms for trash recognition. DL based object detection algorithms include SSD MobileNet, Faster RcNN ResNet and Yolo v2 are used to evaluate the trash and stain detection for cleaning robot application [15]. The author report that SSD MobileNet is an optimal algorithm for cleaning robot trash and stain detection, which provides a good balance between accuracy and computation cost. A deep learning architecture based autonomous cleaning robot was developed with learning ability to generate 2D trajectories from a human kinesthetic samples Jia Yin et al. [16,17]. Ref. [18] proposed a deep-learning framework for table cleaning tasks using Human support Robot. The authors used a 16 layer CNN framework build on the darknet framework. The developed model detects the food stain and food trash on the table with 95% detection accuracy. Taking account of the above facts, in this paper, we are proposing a novel vision-based wall cleaning framework that acts as an add-on for any professional robotic platform to perform wall only cleaning.
The proposed framework is constructed with two deep learning framework to visually detect and segment the Wall/floor area and detect the floor obstacle and wall objects usually embedded on the surface such as doors, pipes, racks, switchboards, fire extinguishers and so on. Once the features on the walls are extracted, the system process the feature using a visual servoing technique to execute the robot wall following. The major challenges encountered during the development of the proposed scheme were its training of large data set, the integration of visual servoing techniques, autonomous operation and translating the theoretical design into a physical Human Support Robot (HSR) robot system. All these aspects are detailed in this paper, and concluded with the experimental results that validate the proposed scheme and its ability to wall follow with the HSR robot under three different scenarios. The proposed scheme is first of its kind that can perform visual based wall only following. The proposed scheme is an initial design towards the autonomous wall-following technique that can perform multiple tasks on a wall surface such as cleaning, painting, crack inspection, and so on. This paper is organized as follows: Introduction, motivation, and literature review in Section 1, Section 2 discusses the proposed system, and Section 3 discusses HSR configuration. Experimental results are detailed in Section 4. Finally, Section 5 concludes this research work.

Vision Based Wall Following and Obstacle Avoidance Framework
The main objective of our proposed system is to use a mobile robot to autonomously follow the wall for cleaning the wall surfaces that have been frequently touched. Figure 1 shows the flow diagram of the proposed system that enables the wall following ability of any mobile robot. In this paper, we utilized Toyota Human Support Robot as a case study to evaluate the scheme. Algorithm 1 shows the execution flow of the wall following and obstacle avoidance system. The first step of our proposed method is to segment the wall and floor surface from the image that was captured from the front camera of the robot by utilizing the semantic segmentation CNN model, FCN8 [19]. Once the segmentation is completed, we separate the wall surface and floor surface by applying the mask on the segmented wall and floor region on the segmented image and passed it to the next stage for further processing. In every wall and floor surface, there are few objects that need to be avoided that include obstacles on nearby wall and objects fixed in the wall include door, switch boxes, pillars, fire extinguishers, sandal racks, and so on. To identify such objects, we adopt SSD MobileNet object detection framework [20][21][22] as a next step. These objects will be identified from the masked segmented wall and floor images, which is used as primary information by the system to decide whether to follow the wall or not. If there is no object detected nearby the wall, the visual servoing technique gets activated to generate the trajectory to follow the wall. The generated trajectory is then subscribed by the velocity controller to produce a motor primitive that executes the locomotion. The detail of each module is described as follows.

Fully Convolutional Network for Wall Segmentation
In this work, FCN8 is used for segmenting out the wall and floor region from captured image [19,23]. It is the first semantic segmentation based CNN framework developed by Shelhamer et al. [19]. Figure 2 shows the internal architecture of FCN8. It comprises three key components including convolution network and deconvolution network and SoftMax probability function for finding the pixel class. FCN8 adopts VGG16 architecture for the convolution network, which contains 15 convolution layers, five 2 × 2 max-pooling layer and stride function 2 between convolutions layers. In the convolution network, each pooling layer down-sample the input by a factor of 2 both horizontally and vertically. In the fifth pooling layer, the input image is downscaled by a factor of 32. The deconvolution network is placed after the last fully connected layer of the VGG16 network. The deconvolution layer performs the up-sampling task (make the output equal to the size of input image ) using unpooling operation. To overcome the spatial information loss by convolution network, layer 4 and 3 of VGG16 was connected with layer 9 and 10 of deconvolution network and upsampled for 2 times to match with dimensions of layer 4 and 3. Finally, for generating the full-resolution segmentation map similar to input image size, the FCN Layer-10 is upsampled 4 times in FCN layer 11 using following parameters 16 kernel, stride = (8,8) and paddding ('same').

Obstacle Detection Framework
Obstacle detection is a key function in our proposed system. It helps the robot to avoid the obstacles nearby wall and also helps to prevent the cleaning module collision with objects fitted in the wall while performing the cleaning task. To perform obstacle and object detection, a deep-learning based object detection framework is adopted in this work. Generally, an object detection framework is two types, one-shot detector and two-shot detector. The one-shot detector is quite fast contrast with two-shot approach algorithms such as Regional proposal networks (RPN). The single-shot algorithm requires one shot to detect multiple objects inside the image, while the RPN series (Fast RCNN, Faster RCNN) required two shots, one for producing the region proposals and another for detecting the object from generated proposals. Moreover, SSD is a lightweight framework widely used in mobile robot applications and also run in low power embedded devices. Hence we adopt the SSD framework for the obstacle detection task. In this work, SSD is combined with MobilNet feature extractor instead of VGG16 feature extractor to detect the obstacle and objects in real-time.
The Figure 3 shows an overview of the SSD MobileNet object detection framework. Here, SSD is run on top of the MobileNet framework and performs the object localization task, which uses the MobileNet as a feature extractor. In SSD, VGG-16 is a default feature extractor and classification network. However, in this work, VGG-16 is replaced by the MobileNet feature extractor, which is introduced by google. MobileNet is an optimized deep neural network architecture that will run very efficiently on mobile devices and low computing CPU. In MobileNet framework, conventional convolutional layers are modified by depthwise separable convolution layers and pointwise convolution function. Here, the depth wise functions perform deep convolutional operations using 3 × 3 kernels and pointwise convolutional are common convolutional layers that use the 1 × 1 kernels. Further batch normalization and ReLU 6 is applied to each convolutional result. The batch normalization functions fine-tune the data by setting the learning parameter, adjusting the learning rate, dropout ratio and restricts the gradient disappearance. The combination of deep wise and point wise convolutions structure activate the MobileNet to speed up the learning process and also reduce the computation time. Finally, the model has a residual connection, which gives the architecture a better accuracy compared to non-residual architectures.
In our work, MobileNet v2 is adopted for feature extraction, which is an updated version on MobileNet V1. The Figure 4 shows the difference between the MobileNet V1 and MobileNet V2 architecture, where depthwise separable convolutional blocks and pointwise convolutional layers similar to V1. However, there is a slight change in the structure of the convolution layers. Here, the first layer is 1 × 1 convolution with ReLu6 and the second layer performs the depth-wise convolution operation and third layer does the 1 × 1 convolution without any non-linearity. Further, the residual layer structure of MobileNet v2 is modified based on the structure of ResNet architecture, which helps to improve the accuracy of depthwise convolution layer without having large overhead. Bottleneck layers that reduce input size are also used. An upper bound is applied to the ReLU layer, which limits the overall complexity. To carry out the obstacle and object detection task, SSD replaces the final few fully connected layers in MobileNet with additional convolution layers to localize and detect objects using the feature maps. It includes the pyramid structured multiple auxiliary convolutional layers on top of MobileNet to extract the feature maps at different resolutions. In SSD, six output feature maps are used for detecting the class of an object and localizing the form of a bounding box where first two (19 × 19) are taken from MobileNet feature extractor and remaining feature maps (10 × 10, 5 × 5, 3 × 3, 1 × 1) are generated by auxiliary convolution layers.
While training, the loss for each prediction is computed as a combination of the confidence loss L con f idence and location loss L location (Equation (1)). The confidence loss is the error in the prediction of class and confidence. The location loss is the squared distance between the coordinates of the prediction. A parameter α is used to balance the two losses and their influence on the overall loss. This loss is optimized through the use of the Root Mean Squared gradient descent algorithm [24]. This algorithm computes the weights w t at any time t using the gradient of loss L, g t and gradient of the gradient v t (Equations (2)-(4)). Hyperparameter β, η are used to balance the terms used for momentum and gradient estimation, while is a small value close to zero for preventing divide by zero errors.

Visual Servoing
The main objective of the visual servo scheme is to design a velocity controller for a robotic system that minimizes the error e generated between a chosen visual features s and its desired values s * in the image frames [25]. Since our aim is to design a control function that could curtail the error value e = [s − s * ] T , the features need to be selected that can be visually traceable by the vision system. Also, it is critical to choose the right feature with respect to the application. In our case, we are following the wall with a particular distance in order to clean the wall surface. So, the first feature that we chose was θ rm made by the z-axis of the robot's vision system with respect to the wall/floor separation line of the corridor. This feature will be zero when the robot is looking forward and the camera is parallel to the wall surface. The x-coordinate x ob of the center point in the detected object ob t = (x ob , y ob ) at the time instant t is chosen as second feature. Both the features could be extracted from the output images of the CNN network. Figure 5 illustrates the operational flow of our wall following algorithm. The algorithm initially checks the non wall features in the output images from the CNN. According to the findings, the algorithm decides whether to follow the wall or to avoid them. During the wall following operation, the system first took the feature information and navigates in consonance with the relative position and orientation with respect to the separation line. While navigating, the path has been generated up-to the goal point P g which is calculated from the constant look up distance L from the center of the camera. For every new frame, the P g will be increased for the L distance according to the current position of the robot. The relative position is calculated from the robot starting point and the off-set distance D L which will be used as a gap that needs to be maintained from the wall. The relative orientation is calculated as deviation from the detected feature θ rm i.e., θ − θ rm , shown in Figure 6. Based on the kinematic model of the desired robot and with the shown model, we can able to generate the velocity commands for the robot. As mentioned earlier, the key distance value that our system needs to maintain to achieve the wall following operation is D L . These values will always be constant unless there is an obstacle detected in the frame. When the obstacle is detected, the D L will be increased and the value will be calculated not from the wall separation point but from the center point of the second detected feature x ob . According to the selected feature, the feature error was generated, which aids in producing the velocity commands in the velocity controller. The velocity values are passed to the robot controller unit which converts the commands to appropriate motor RPM with reference to the robot's kinematics. The robots wall following motions are generated which will update the image frames and its features. The feature errors are again calculated and the same process is continued till the robot completes the task. The feedback loop of the proposed system is shown in the Figure 7.

HSR Robot
In this article, we utilized Toyota HSR as our experimental platform to validate the proposed system. This section describes the HSR robotic architecture which provides an overall view of the implementation of the proposed scheme, autonomous navigation, and the perceptional components that aids in system execution. HSR is a mobile manipulator platform developed by Toyota which consists of a set of different sensor and actuator modules. The robot is equipped with an RGB-D camera (specification on Table 1) on top of its head with a display, manipulator with four degrees of freedom, mobile base module, and system body module which consist all computational equipment shown in Figure 8. The system runs with the Robot operating system (ROS). All operations of wall following runs under the ROS architecture. The ROS system enables the connection and linkage between the perceptional modules and the execution modules. All positional transformation of each module is pre-linked in the system which gives better localization values. The positional transformation all done in the 3D space wich will be used in our susyem to estimate the selcted feature. Figure 9 shows the schematic representation of HSR hardware architecture configuration, which comprises of two computing units, namely primary computing, and secondary computing. The first computing system is where we run our primary OS ubuntu within the ROS system will be functioning. This computing system takes the authority to communicate between each controllers, drivers that runs sensors and actuators. Also, this computing device is taking the lead manages all types of communication between the user. The display of HSR is connected with the primary computing system internally to monitor the process.   The second computing system is meant to take care of all vision system processes. Nvidia's Jetson TK1 board (GPU) is used in the secondary computing unit to execute the vision system task. This system acts as a slave for the primary computing device, so ROS also consider this system as a slave device. The primary function of this system is to compute all the received images and convert the information into ROS message format and sends back to the master system. Also, this system is responsible for receiving the ROS messages and decodes to pass it to the deep learning framework. Both the primary and the secondary systems are connected over Transmission Control Protocol/Internet Protocol (TCP/IP) and share all the ROS topics.
To execute the wall following task, the primary computing system activates the ROS connection and establishes the communication between each module. Once the connection is established, the ROS system takes over to execute the wall following function. The first step that the system performs is to activate the RGB-D camera and convert the video stream into separate image frames to pass it to the secondary computing for processing. The secondary system runs the FCN8 framework and SSD MobileNet framework to perform the segmentation process and object detected process respectively. The processed images are passed back to the ros layer as appropriate ros messages.
It is critical for the robot to properly localize itself for a robot to perform error-free navigation. The HSR is built with various sensors that can be used as and fused to perform localization. In our case, we fused various sensors information such as 2D lidar, camera, wheel encoder, and the IMU to identify the current location. We used traditional SLAM [26,27], and localisation [28] techniques to perform the localisation. We utilized preexisting ROS packages to run on the ROS networks. Once the robot is activated the ROS system starts running the localization module to get the global position of the robot. Eventually, the ROS layer sends the request to the secondary computing device to share the feature information. Once these two information's are available, the system calculates the relative position as explained in the previous section. With respect to the field of view from the camera, the goal point has been defined which generates the path and velocity commands.
The velocity commands are subscribed by the velocity controller which produces the motor primitives. The HSR motion is designed under the principle of differential drive, so the controller generates two main motor frequency L pwm and R pwm . The HSR equipped with the 24 V DC motor that drives the robot. These motors are connected with the motor driver which is controlled by the arm controller that communicates with the robot's primary computing system. Modification of the HSR robot needs to be carried out by attaching the cleaning payload on both sides to execute the wall cleaning task. The cleaning payload consists of a rotating mechanical brush that cleans high touch areas on the walls. The rotating mechanism works by securing a motor to the brush. When power is supplied, it will allow the brush to rotate. The CAD model of the modified HSR robot is shown in Figure 10. However, this paper focuses on the successful completion of the wall following without the cleaning payload. The Figure 10 is provided to give an overview idea of our proposed system's end application.

Experimentation and Results
This section describes the experimental procedure and results obtained from the proposed system. The experiment was designed with three steps: data-set preparation and training the deep learning frameworks, evaluating the trained model in offline using indoor data-sets and testing with HSR robot platform. Here, the segmentation model performance was evaluated with pixel classification accuracy, Intersection Over Union (IoU) and F1 score metrics [19] and obstacle detection model was assessed through precision (Equation (5)), recall (Equation (6)) and F1 score metrics (Equation (7)) .

Precision(Prec
Here, tp, f p, tn, f n represents the true positives, false positives, true negatives and false negatives respectively as per the standard confusion matrix.

Data-Set Preparation and Training
The data-set preparation process involves collecting indoor images with different obstacles and different wall and floor background. We collect the data-set from existing indoor image database through online [29][30][31][32][33][34][35][36] for both corridors and the obstacle images. However, some obstacles data is not clear and not from the perspective of the robot. So we used the robot camera to capture a few images of the obstacles from the perspective of HSR robot. The specification of the RGB-D camera is given Table 2. In total, there are 8000 images collected, which contains various types of obstacles that include fire extinguisher, photo frame, signboard, wall ornaments (wall clock, decorative lights), safety box, switch box, furniture (chair and rack), window, door, pipes, dustbin and plant pot. Some sample images were used for training the segmentation, and object detection network is shown in Figure 11, and Figure 12 respectively.  To improve the network learning rate and controlling the over-fitting, the data expansion methods are applied in collected images. Here, image scaling, rotation and flipping method are used to increase the number of sample images in the data-set. Further, in our experiment, the image size of (640 × 480) resolution was used for both training and testing. The collected image data set were separated into for FCN and the SSD Mobile Network. The corridor's data sets were trained for the segmentation process in FCN. On the other hand, we used the obstacle image data set to train the SSD Mobile Network. The two CNN models (FCN and SSD MobileNet) was developed in Tensor-flow 1.9 Ubuntu 18.04 version and trained using following hardware configuration Intel core i7-8700k, 64 GB RAM and Nvidia GeForce GTX 1080 Ti Graphics Cards. Figure 13 shows the results obtained from the training stage for the object detection model. The training was run 25,000 iterations and tracks the loss functions like classification loss, localization loss and total loss. In this process, we observe that loss was minimized each iteration. In the starting stage, the classification loss was above 4.4, the localization loss is above 1.2 and the total loss was above 8 after 25,000 iteration the total loss,classification loss and localization loss was gradually reduce and reach bellow 3.2, 2.4 and 0.6. The graph result shows that loss is decreasing over iteration and loss fluctuation is stable at 25,000 iterations, which indicate that the model converges and is trained properly.

Optimization
The two networks are trained with mini-batch SGD algorithm and network hyperparameters are obtained from a pre-trained model VGG16 [37] and SSD MobileNet trained by Ms COCO data-set [38]. The networks were fine-tuned with a batch size of 32 and trained for 10,000 epochs. The network uses the learning rate of 0.0002 and a momentum of 0.9, respectively. Further, a k-fold cross-validation procedure is used to evaluate the training image data-set. In this scheme, images are split into k groups and take k-1 for training the network, and the remaining one set is used for testing. In our work, we use 10 fold cross-validation scheme. The experimental results images shown are obtained from the model with good accuracy.

Offline Test with Indoor Data-Sets Images
To carry out the offline test, the models inference graph is configured into HSR and tested with four indoor image database include NAVVIS indoor database [33], Quattoni et al. indoor scene database [34], shih et al. 360-indoor dataset [35] and Adhikari et al. indoor object detection dataset [36]. To verify the robustness of the system, the test images are chosen with different wall and floor colors and images with different lighting conditions. The object fitted in the wall and located nearby wall are only considered as an obstacle. There are 1200 images (consider each class 50 count) chosen for the offline experiment. The segmentation and detection results of the indoor image database are shown in Figure 14a-i and Figure 15a-i detection, and performance metrics results are reported in Table 2. The experimental results indicate that the segmentation framework successfully differentiates the floor and wall from complex test images includes different lighting conditions and different wall and floor color and obtained pixel accuracy is above 90%, IoU average of 91.08% and average F1 score of 90.93% respectively. In this experiment, we observe that the fall of detection accuracy in some classes is due to false classification (ex: false classification between signboard and photo frame) and miss detection due to the tiny size of objects. ( Ex: safety box, switch box). However, this will not heavily affect the functionality of the robot. Further, the table result Table 2 indicate that the accuracy of SSD MobileNet frame for obstacle detection is above 90% and detects the obstacles on the wall and nearby wall with above 92% confident level. The offline experimental result proves that the proposed system is more suitable for deploying in HSR for real-time obstacle detection and avoidance.

Real-Time Test
The Real-time test was performed with HSR robot under various scenarios that have a straight corridor that exists with few obstacles along the wall surface. Before started the experiment, the robot was switched on and let the robot to reboot the internal system for few minutes. After the successful reboot of the robot, the robot will undergo a self-calibration mode, which calibrates the head position, and arm position to go back to its initial position. Then the robot was drove manually using the joystick controller to the testing spot. The user specify the wall direction (left or right wall) of which the robot is going to follow. Once the robot is positioned properly from the starting point, the autonomous wall following programming will be activated. The robot will start the wall following and avoidance process for a defined distance. The end distance will be varied according to the corridor chosen and its length. In the first scenario, we tested the robot on a straight wall with a length of 2 m shown in Figure 16(top left). In the second scenario, we chosen a straight wall with a distance of 1.7 m Figure 16(top right). In the final scenario, the corridor is chosen for a distance of 2.3 m Figure 16(bottom). In all considered scenarios, there are unique obstacles presented along with the wall surface. Our aim behind conducting this experiment is whether the robot could able to avoid those obstacles with the proposed scheme and only follows the wall. In the coming section, we have discussed on the results which we separated as neural network results and wall following results which is an outcome of visual servoing.
The experimental trials were initiated by manually driving the robot to test scenario 1 starting point. Then the autonomous wall following scheme was initiated. Once the program starts executing, the camera got activated to publish the image frames to the secondary computing device. The trained model of FCN8 that is running on the secondary system starts to segment the images as floor and wall. The Figure 17(left) shows some of the segmented images during the experiment-I trials. The result demonstrates the ability of the system to successfully segment the image frame as wall/floor and could identify the separation line. In this case, the wall/floor segmentation has obtained high pixel-wise accuracy of 89.2 %. Further, the output image is again processed for detects the obstacles that present along the wall surface. Figure 18(left) shows the output of the SSD mobileNet that shows the detected obstacle on the segmented wall surface, in a closed-door scenario. For such obstacle detection, the system has detected the obstacle with 95% confidence level.   Object detection result during trials in scenario-1 (left), scenario-2 (middle), scenario-3 (right).
In the second set of experiments, the robot is placed in the starting point of scenario 2. We followed the same procedure to perform the wall following task. In scenario 2, the illumination was quite dimmer than the previous scenario as you can see in the Figure 16(top right) However, total successful wall/floor segmented frames that the proposed system obtained 88.5% pixel-wise accuracy. Further, the obstacle detection algorithm detects the obstacle (PVC pipe) on a segmented region with 90% confidence level. Figures 17 and 18(middle) illustrates the outcome of the segmented images and the obstacle detection on the segmented wall surface respectively. Corresponding to the trial in scenario 3, the procedures were pursued as the previous trials. In this case, the obstacle that we considered was a shoe rack. In this trial, the illumination is much higher than the first scenario. Figure 17(right) depicts the results that were generated from the FCN. Throughout the trial, the total percentage of the number of frames that has been successfully segmented with 89.5% pixel-wise classification accuracy. Regarding the obstacle detection the Figure 18(right) shows the obstacle detected frame on the segmented wall surface. In this case study, the object detection algorithm detects the outdoor obstacle (shoe rack) with 93% confidence level. In this analysis, we observe that there are some miss classification in the segmentation process. This miss detection was encountered due to the similarities in the features information between floor and wall, sensor noise, and perspective viewpoint. We will consider using few depth data directly for the segmentation and identification process in the future works to reduce the noise and improve the accuracy. Tables 3 and 4 summarise the performance metrics result for offline and real-time case study of segmentation and obstacle detection results. The table results Table 3 indicate that the trained FCN model obtained an average of (offline and real-time) 89.47% pixel classification accuracy, 89.97% mean IoU, and 89.84% F1 score respectively. On the other-hand SSD MobileNet detects the obstacle on segmented image region with an average precision of 91.87% , recall, F1 score and accuracy of 90.77 and 91.01, 89.44 respectively. The experimental results show that the performance of segmentation and obstacle detection framework efficiency is stable for both offline and real-time test.

Comparison Analysis of Different Architectures
The performance of the segmentation model and object detection framework was compared with other popular segmentation and object detection framework. In this analysis, FCN-AlexNet, FCN16 framework, was considered to compare the performance of the FCN8 segmentation framework. Similarly, Faster RCNN ResNet and Faster RCNN Inception and YOLO V2 frameworks are used for comparison with the SSD MobileNet object detection model. The models are trained using the same data-set and a similar amount of time and tested in NVIDIA GPU cards with 1200 test images used in the offline experiment. Tables 5 and 6 shows the comparison analysis of different architecture. The Table 5 results indicate that FCN8 obtained better pixel-wise classification accuracy than FCN AlexNet and FCN16. Furthermore, the segmentation accuracy of FCN AlexNet is very low for object class with small pixel areas. Further, object detection comparison results indicate that Faster RCNN ResNet and Faster RCNN Inception model have better performance than SSD MobileNet and Yolo framework. However, those model computation time is quite high. Those uses Regional Proposal Network (RPN) require two-shot to detect the multiple objects in an image (one for generating the region proposal and one for detecting the object). Hence its computation time is quite high compare to one-shot detection scheme SSD and Yolo, and hard to run in low power embedded platforms and mobile devices in real-time. YOLO based object detection models are faster compared to SSD and RCNN series, but detection accuracy is poor, specifically small object detection. SSD-MobileNet shows a good balance between accuracy and computation cost and its inference time is quite fast enough to be used in real-time object detection and obstacle detection task and specifically in mobile robot platform.

Evaluation of Wall Following and Avoidance
We evaluated the wall following the efficiency of the robot under the same set of scenarios. The wall following algorithm was running in the primary computing device. Once the wall/floor separation process is done the information of the separation line is passed to the primary device which runs the visual servoing scheme. The servoing scheme computes and generates the velocity commands with respect to the D L value. Since we are considering the proposed scheme for the wall cleaning purpose, the radius of the rotating cleaning brush is taken as a D L . So in this case the D L value is taken as 20 cm. With the defined D L value, the robot follows the wall which is visualized in terms of path on the map as shown in Figure 19(top left). From the image, it is clear that the robot can successfully follow the wall. Figure 20(top left) shows the series of images taken from the robot's camera embedded with timestamps of each frame. The door is actually detected at frame number four with time 00:19. The graph that depicts the time and D L relation in Figure 21(top left) shows the value increasing at the same time when the door is detected. That shows the accuracy of the robot to detect the obstacles and the ability to avoid the door. Figure 19(top left) also shows evidence of the avoidance ability of the robot at the end of the positions. There are few deviations in the path of wall following is due to the sensor noise that affects the localization and classification errors that happened in the primary computing device.  In the second scenario, we conducted a similar trial with the HSR robot. Figure 19(top right) shows the localized position on the map, which is very near and mostly linear to the wall surface. Series of image frames with the timestamp is shown in Figure 22(top right) in which we can also witness the detection of PVC pipe at the time 0.20 min. Since the pipe is protruded out from a small distance, the D L for avoidance was increased in order to cross the obstacle. The graph shows the D L value with respect to the time of operation is shown in Figure 21(top right). From the above result, it has been proved that wall following and avoidance of the robot with the proposed system in scenario 2.
Likewise, in scenario 3, the robot pursued the same procedure to execute the wall following task. In this trial, the robot is avoiding a shoe rack which is shown in one of the series of Figure 23 that is embedded on the wall side. Figure 19(bottom) shows the linear path with respect to the wall surface and gets diverted when it is near to the shoe rack. With regards to the D L value how the value has been changed with respect to the time is shown in Figure 21(bottom). From the results, it is clear that the proposed scheme could significantly improve the wall following efficiency with the HSR robot and can be extended to a similar mobile robot for various applications. Also, from the results, it is clear that the proposed methodology can definitely be applied for the wall cleaning task. Through this study, autonomous vision-based wall following system has been demonstrated in real-time that inaugurates a significant untapped research and development opportunity related to wall applications such as cleaning, painting, crack inspection, and so on.

Conclusions
In this article, we proposed a novel vision-based wall following framework that acts as an add-on for any professional robotic platform to perform wall only cleaning. The proposed scheme was constructed with two separate CNN networks, first is FCN8 which holds the liability of floor/wall segmentation process, and the second network is SSD mobile network that detects the obstacles embedded on the segmented wall surface and obstacle near by wall. We introduced visual servoing technique that uses the features extracted in the CNN layer to generate the motor commands that enables the robot to perform wall following task. Also, we explained the system architecture of the HSR robot which we are using as the experimental platform. We conducted experiments on three different scenarios to evaluate the proposed scheme with the HSR robot. The CNN networks are trained with various relevant data sets before we conduct the experiments. As per CNN concern, the system can significantly perform higher wall/floor segmentation rate and obstacle detection rate. The wall following results are shown as path tracings on the generated 2D map. The generated path tracing in all the experiments is mostly parallel to the wall which means the robot can successfully follow the wall surface in all considered scenarios. With respect to the obstacle avoidance, the path track shows the deviation from the wall surface at particular time instance from where the obstacle is detected.
Overall the system shows significant performance in terms of wall-following tasks. Through this study, autonomous vision-based wall following system has been demonstrated in real-time that inaugurates a significant untapped research and development opportunity related to wall applications such as cleaning, painting, crack inspection, and so on.
There are few deviations in performance is due to the sensor noise that affects the localization and classification errors that happened in the primary computing device which will be eradicate through researching on following topics in the future works.