The Deep Convolutional Neural Network Role in the Autonomous Navigation of Mobile Robots (SROBO)

: The ability to navigate unstructured environments is an essential task for intelligent systems. Autonomous navigation by ground vehicles requires developing an internal representation of space, trained by recognizable landmarks, robust visual processing, computer vision and image processing. A mobile robot needs a platform enabling it to operate in an environment autonomously, recognize the objects, and avoid obstacles in its path. In this study, an open-source ground robot called SROBO was designed to accurately identify its position and navigate certain areas using a deep convolutional neural network and transfer learning. The framework uses an RGB-D MYNTEYE camera, a 2D laser scanner and inertial measurement units (IMU) operating through an embedded system capable of deep learning. The real-time decision-making process and experiments were conducted while the onboard signal processing and image capturing system enabled continuous information analysis. State-of-the-art Real-Time Graph-Based SLAM (RTAB-Map) was adopted to create a map of indoor environments while beneﬁting from deep convolutional neural network (Deep-CNN) capability. Enforcing Deep-CNN improved the performance quality of the RTAB-Map SLAM. The proposed setting equipped the robot with more insight into its surroundings. The robustness of the SROBO increased by 35% using the proposed system compared to the conventional RTAB-Map SLAM.


Introduction
A robot's autonomous navigation is a complicated task that relies on composing a representation of an area, grounded by robust visual processing and computer vision [1]. Robots need to simultaneously maintain a steady self-localization and representation of objectives. Autonomous ground robots have various applications, i.e., surveillance, healthcare, and transporting care facilities. With enhanced development, they have the potential to automate a vast collection of labor-intensive chores and execute well in unstructured, challenging environments with extreme pressures and temperatures.
An autonomous mobile robot constructs a robust model of the environment (mapping), locates itself on the map (localization), governs the movement from one location to the other (navigation) and accomplishes assigned tasks. Cognitive robotics [2,3] is a compelling topic among researchers, computer vision and robotics communities. It adopts artificial intelligence (AI) to enable agents to make decisions independently. Cognitive robotics offers more robust solutions in more critical situations, i.e., heterogeneous environments, concerning accuracy, integrity, and potentially enhancing execution.
There has been a significant improvement in the accuracy and robustness of visual simultaneous localization and mapping (VSLAM) in the last two decades [4][5][6][7]. However,

Design and Methodology
Estimating the pose of an agent (translation and orientation) over time is the process of odometry [5,21]. An agent (e.g., robot, vehicle, and human) uses exteroceptive (external data) and proprioceptive (internal to the system) sensors to estimate its position relative to an initial location. Visual Odometry (VO) uses a stream of images acquired from a camera attached to the robot to localize it. Wheel odometry (WO) provides data on the translation and rotation of wheels by measuring the number and speed of wheel rotations.
Odometry of an inertial navigation system (INSO) is calculated by employing an inertial measurement unit (IMU) consisting of accelerometers, gyroscopes, and magnetometers . The gyroscope's angular rate measurements are integrated for a high-update rate attitude explanation. At the same time, the magnetometer provides a heading reference using a state-of-the-art Kalman filter algorithm to determine the attitude, velocity and position. The attitude is calculated by integrating angular rate as estimated by the gyros over a period. The accelerometer measures the system's linear acceleration due to motion and the pseudo-acceleration caused by gravity. WO constructs noisy data with high error rates due to sliding, uneven surfaces, and weight shifts. On the other hand, VO can afford precise trajectory estimates, with a relative position error of [0.1-2.0]% [22] with progressive frames and scene overlap to extract probable movement.
The fusion of proprioceptive (inertial measurement unit, encoders) and exteroceptive sensors (camera, LiDAR) can improve the robustness and accuracy of odometry estimation [23]. The map is generally represented as landmarks within a three-dimensional space, while sensors traverse the environment and generate a trajectory [24,25]. The landmarks are observed from the sensor poses, and the most feasible landmark positions and sensor poses are estimated. Information regarding the visual landscape of where the robot is navigating is as essential as creating a map for autonomous navigation and path planning.
Dense (pixel-wise) semantic segmentation was employed using the residual learning deep neural networks (ResNet) framework [26]. The result is a training network made of layers of learning residual functions concerning the layer inputs instead of learning unreferenced functions [27]. Dense per-pixel semantic segmentation classification was conducted by overlaying masks on the image.
Single-Shot Multi-Box Detector (SSD) is one of the fastest object-detection algorithms [28]. A single convolutional neural network discretizes the resulting space of bounding boxes in images into further default boxes over different scales and aspect ratios per feature map location. The network generates the scores for each object in the scene for classification and adjusts to better approximate the object's shape. In this study, SSD-MobileNet is generated by combining the SSD network as a meta-structure with the MobileNet [29] for surface defects. SSD-Inception [30,31] is another object-detection algorithm used in this research, which is designed based on SSD, with high performance and accuracy. It uses a single convolutional neural network to detect the object in an image in that the inception block replaces the extra layers in SSD.

Mechanical and Electronic Hardware
The primary control system in SROBO is Jetson Nano, which possesses its sensor information such as RPLidat A1, MYNTEYE-D camera, an IMU and two 12V DC motors with a rotary quadrature encoder. The integrated control system has 8GB RAM and a 128-core Maxwell GPU powered using a power bank with 5 V and 3 A. A TB67H420FTG dual-motor drive carrier handles the speed and direction of motors via PWM pulse computation. We incorporated the PID (Proportional, Integral, Derivative) regulator to facilitate the control loop feedback mechanism processed on a Teensy 4.1 microcontroller.
The LiDAR (2D light detection and ranging) rotates 360 • clockwise and has a range of [1.5-12,000] mm. It collects 1450 sampling points at a frequency of 5.5 Hz, in that eight thousand sample points were collected per second at 10 Hz. The LiDAR emits modulated infrared laser signals and receives echoed signs from detected objects. It is powered by 5 V USB protocol via the primary control system.
We used an MYNTEYE-D-1000 camera with 120 FOV (field of view) and depth map resolution of 1280 × 720 at 60 FPS (frame per second). MYNTEYE's FOV is D:121 • H:105 • and V:58 • and has six-axis IMU.
A nine-axis mems motion tracking device-MPU9250-is implemented and combined with MYNTEYE's IMU. The sensors and peripherals were calibrated and integrated to facilitate high performance.
The visual data from the MYNTEYE-D camera is blended with inertial measurement units (IMU), wheel encoders, the RPLiDAR laser scanner, and an extended Kalman filter (EKF) [32] to achieve state estimation in real time. Data are assembled into a unit with the high-frequency pose estimation from EKF, odometry and point clouds corresponding to the camera. Frequently received information is combined to build a consistent global 3D map via RTAB-Map [33]. Robust and accurate pose estimation based on sensors combination are transformed into a fixed coordinate frame in real time to build a global map. The sensors calibrated information is transformed to the robot's base frame using the unified robot description format (URDF) (http://wiki.ros.org/urdf; accessed on 18 May 2022) in the ROS (https://www.ros.org/; accessed on 18 May 2022) Melodic platform on Ubuntu 18.04. Although inferring properties from vision is a complex task and requires human perception [34]. This study aims to integrate automated image analysis semantic segmentation and object-detection algorithms combined with VSLAM to provide the robot with a richer perception of the environment. TensorRT SDK (https://developer.nvidia. com/tensorrt; accessed on 18 May 2022) was installed on the embedded device in that the NVIDIA CUDA parallel programming model was used for deep learning and optimization inference. The PyTorch framework was used for training models. The open neural network exchange (ONNX) system enabled interoperability and format exchange in the algorithm. Images from PyTorch were converted to TensorTR for training using training steps (epochs).
SROBO has 3DOF and is controlled across the x, y, and yaw axis. The communication through the I2C (Inter-Integrated Circuit) and general-purpose input-output (GPIO) protocols enables robust and consistent communication between the system and subsystems. SROBO's kinematics is determined as it receives information on position and speed through the respective algorithm and determines the PWM references of the motors.
Autocad Inventor 2022 is used to create the 3D model of the ground robot body parts, designed for 3D printing and laser cutting. The mechanical structure is a modular assembly, where body parts can be manufactured and upgraded individually, allowing easy parts construction and replacement as required. Figure 1 illustrates the mechanical parts created in Autocad. Table 1 lists the hardware specification and prices for components used in devising SROBO. SROBO's wheels are taken from a mobile car made by Makeblock (https://www.makeblock.com/; accessed on 18 May 2022). Figure 2 displays the electronic parts involved in the configuration of the ground robot.

Graph-Based SLAM
Solving SLAM problems concern evaluating an agent's trajectory and the map of the environment. Due to the intrinsic noise in the sensor measurements, a SLAM problem is usually expressed via probabilistic mechanisms. We describe the agent's movement in an unknown environment along a trajectory described by the sequence of random variables; P 1:S = {P 1 , . . . , P S }. When moving, it formulates a sequence of odometry measurements u 1:S = {u 1 , . . . , u S } and perceptions of the environment z 1:S = {z 1 , . . . , z S }. It required estimating the former probability of the robot's trajectory P 1:S and the map M of the environment given all the measurements plus an initial position P 0 : Estimating for Equation (1) involves operating in high-dimensional state spaces, which is only manageable if a dynamic Bayesian network configuration or, in other words, the graph-based formulation is used. A Bayesian network is a graphical representation defining a stochastic method as a directed graph. The graph has one node (x s ) per random variable in the procedure, and a supervised edge (z s or u s ) between two nodes with a particular constraint and dependency [35]. A single front-end observation z s might result in multiple conceivable edges connecting different poses in the graph. A graph-based mapping algorithm computes a Gaussian approximation of the previous location over the robot trajectory when the observations are influenced by Gaussian noise, and the data relationship is comprehended. The Gaussian mean is determined as the arrangement of the nodes that maximizes the likelihood of the observations. Once it is selected, the input matrix of the Gaussian is obtained. For more detailed information, the readers are referred to [35].
In a real-time graph-based SLAM approach such as RTAB-Map [33], a graph is constructed where nodes represent a robot's poses or landmarks. A sensor measurement data constrains the nodes created by connected poses. Noisy observations of the information are filtered through finding a configuration of the nodes that is maximally consistent with the measurements and solving a significant error minimization problem [36] while receiving information via various sensors. Loop closure is an essential step in graph-based SLAMs that helps optimize the graph, generate a map and correct drifts. A robust localization is needed to identify the revisiting past locations used for loop closure and map update.
Real-world dynamic environments cause challenges in autonomous navigation, mainly when dealing with localization on the generated map. It is due to the need to explore new areas and a lack of biased environmental features. This problem is addressed using multi-session mapping that enables a robot to recognize its relative position to a previously generated map, plan a path to a previously visited location, and localize itself. It allows the SLAM approach to initialize a new map based on its allusions and transform it into the previously created map while a previously seen area is discovered. It avoids remapping the whole environment when only a new location should be explored and incorporated. This approach is considered in this study when dealing with semantic segmentation in images described in the following section.
Graph-based SLAM systems are constructed of three parts: front-end (graph construction), back-end (graph optimization), and final map [20]. The front-end processes the sensor data to extract geometric connections, i.e., between the robot and landmarks at different points in time. The system's back end constructs a graph representing the geometric relations and uncertainties followed by final map generation. The graph structure is optimized to obtain a maximum likelihood explanation for the designated robot trajectory. The sensor data are projected into a standard coordinate frame with the known trajectory.
A short-term memory (STM) module in RTAB-Map, is another advantage of this SLAM. It spawns a node by determining the odometry pose and sensor data in visual words used for loop closure (LC), proximity detection (PD) and local occupancy grid of global map assembly (GMA).
Our local occupancy grid (LOG) is generated from the camera depth images (RGB-D), point clouds and the LiDAR. The sensor's range and FOV determine the size of a local occupancy grid. LOG generates a global occupancy grid by converting it into map referential using the poses of the map's graph. LC is conducted by identifying features kept in the node and corresponding nodes in the memory for validation.
Nodes are created at a fixed rate in milliseconds (ms), and a link retains a rigid transformation between the two nodes. Neighbor links are joined in the STM within constant nodes with odometry transformation data, and later, LC and PD links are added as detected. All the links are used as constraints for graph optimization. When a new LC or proximity link is added, graph optimization propagates the computed error to the whole graph to reduce any drifts [33,37].
RTAB-Map's memory is divided into two; working memory (WM) and long-term memory (LTM). When a new node is transferred to LTM, it is not available anymore for modules inside the WM to manage memory load. A weighting agent recognizes more significant areas than others to decide which nodes to transfer to LTM. It is accomplished using heuristics such as the longer a location is observed, the more critical it is; therefore, it should be maintained in the WM.
When adding a new node, STM assigns zero weight to the node and compares it visually (deriving a percentage of corresponding visual words) with the last node in the graph. When an LC occurs with a location in the WM, neighbor nodes can be brought back from LTM to WM for more LC and PD. As the robot moves in a previously visited area, it recognizes the ex-visited locations, incrementally develops the created map further and localizes itself [37].
Input data in RTAB-Map are used to perform feature detection and matching, motion prediction, motion estimation, local bundle adjustment (LBA), and pose update [38,39]. Feature detection and matching execute neighbor distance ratio (NNDR) [40]. NNDR test is a ratio of the first and second nearest neighbors to the feature descriptor. It uses a fast feature descriptor method known as binary robust independent elementary features (BRIEF) descriptors of the extracted features [37]. In motion estimation, the Perspective-n-Point iterative random sample consensus (RANSAC) method [41] is used to compute the transformation of the current frame according to features. The resulting transformation is refined using LBA on features of keyframes. In pose update, the output odometry is updated using transformation estimation. The covariance is computed using the median absolute deviation approach between feature correspondences.

Deep Convolutional Neural Network (Deep-CNN)
Deep networks inherently incorporate different features (low to high) and classify in an end-to-end multilayer style in that the depth enriches the features level or the number of stacked layers [42]. Convolution is a process of two functions with real-valued arguments that are defined for any functions in the following format; Considering that x and ω are defined only on integer t, we can define the discrete convolution as follows: where ω(a) is a weighted function (kernel), a is the measurement period, x(a) is the input and s(t) referred to as the feature map [43]. The time index t is the data collected at regular intervals. In machine learning, the input is generally a multi-dimensional data tensor, and the kernel is a multi-dimensional tensor of parameters the learning algorithm adjusts. In a CNN, each kernel member is used at every input position. For a 2D input I and 2D kernel K, the cross-correlation or convolution function is written as: Each kernel component is used in a CNN at every input position. It makes it more efficient in terms of memory requirements and statistical efficiency.
Compared to simple NNs, which involve several hidden layers, CNN consists of many layers. This characteristic qualifies it to represent highly nonlinear functions compactly. CNN involves many connections, and the architecture generally comprises diverse classes of layers, including convolution, pooling with thoroughly connected layers, and regularisation.
In the first phase of a CNN, the layers achieve several convolutions in parallel to create a set of linear activations. It is followed by manipulating linear activations by nonlinear functions known as detector phase that rectifies linear activation functions. Pooling functions are used to modify the output of the layer. Furthermore, it substitutes the net result at a specific location with an outline statistic of the adjacent results.
To learn complicated features and functions that can represent high-level generalizations, CNNs require deep architectures. Deep learning methods are used to train a convolution neural network based on large-scale datasets [44]. Deep-CNNs are a specialized kind of ANNs that use convolution in place of matrix multiplication in at least one of its layers [34,43].
Deep-CNNs, consist of a large number of neurons and multiple levels of calculating non-linearity. Divisions in the deeper layers may indirectly interact with a more significant portion of the input in a Deep-CNN. This allows the network to efficiently describe com-plex relationships between numerous variables by constructing interactions from simple construction unions that only describe sparse relations.
In this study, the CNN comprises a small number of reasonably complex layers, each with different stages. There is a one-to-one mapping between kernel tensors and network layers. The pooling function enables making the model approximately invariant to small input translations.
Deep learning architectures tackle semantic segmentation and object detection, a highlevel task toward complete scene understanding [45]. It marks the object in an image and reserves a classified index. In the subsequent phase, localization or detection is performed for fine-grained inference, providing the classes and additional information regarding the classes' spatial location. Performing scene-related navigation via trajectory and floor detection is a complicated task for embedded devices as it demands a lot of onboard computation and power. To tackle the issue transfer learning method is employed [45,46]. Pre-trained inferences and image recognition backbones were used to convert the model into a deep-CNN for per-pixel labeling, transformation and object detection.

SROBO's Autonomous Navigation Using Deep-CNN and RTAB-Map
Object detection and semantic segmentation frameworks on frame images with confidence scores and per-pixel colored labels are considered to verify the local target map. The confidence scores higher than 85% and brown mapped areas in Figure 3 are used in Deep-CNN for identifying the target and the areas SROBO is allowed to track to achieve the target. The results from RTAB-Map with and without the 2D LiDAR scanner indicate the following frame estimation according to recommendations from the earlier structure. The confidence scores and colored maps in each frame are updated to optimize the interframe spatial continuity. Deep-CNN results of the current frame and estimation results from VSLAM are projected into a local map and updated to the global target map by a pose modification matrix between the new frame and former frames acquired by the point clouds for more promising performance. There has been a great progress in base convolution neural network (CNN) training algorithms, including AlexNet subset [18], GoogLeNet [47], Inception [31], MobileNet [48] and ResNet [27]. ResNet has the highest accuracy compared to the other algorithms, 96.4% [45]. These base CNNs are high-level training features mostly used for classification and detection with Deep-CNN architectures. The advancement in the architectures resulted in more optimal and complex system such as network pruning [49], reinforcement learning [50] and hyper-parameter optimization [51], structure connectivity [52] and optimization [53], to name a few. It makes it more challenging to implement Deep-CNN on mobile robots due to the required amount of power and processing time. The advancement of micro/single board computers and embedded systems provided low-cost solutions that allowed robots to integrate Deep-CNN and data-fusion algorithms. We will use semantic segmentation and object-detection algorithms to facilitate more robust navigation by SROBO.

Data Analysis and Results
MYNT EYE camera's intrinsic parameters were calibrated to transform the depth and color images using the standard calibration tool [54]. SROBO mobile robot is displayed in Figure 4. Figure 4a illustrates the conventional graph-based RTAB-Map SLAM generated in an indoor environment. The transformations between different sets of coordinate frames [55] were performed via the URDF model of SROBO. Sensor data are transmitted in their original frame across the network and stored in the original frame. Figure 4b shows a dense 3D map of an indoor scenario with SROBO's trajectory. Semantic segmentation aims to create dense projections extrapolating labels for each pixel, labeled with its enclosing object or area class. Figure 3a,c show semantic segmented images created using SUN RGB-D (http://rgbd.cs.princeton.edu/challenge.html; accessed on 18 May 2022) database in real time. The scene labeling of per-pixel class segmentation is created for each image pixel using ResNet RGB-D dataset (512 × 400) and (640 × 512) input images. RGB-D Sun dataset with 10,335 labeled RGB-D dense images of indoor objects with 146,617 2D polygons and 58,657 3D bounding boxes [56] was used. The deep learning was performed on SROBO's primary control system with an active power supply and consumption of 1.25 W.
It is combined with ResNet-18 architecture [27] of 640 × 512 and 512 × 400 for the object segmentation. It offers the possibility of learning hierarchies of feature representations for indoor scenarios. The resNet-18 network has 152 layers of deep training architecture in that each layer's information is transferred to the next layer to be used to extract further information.
The solid pixel segmentation in Figure 3a,c were created using ResNet of 640 × 512 and 512 × 400, respectively. The images corresponding with the classified images of the standard PASCAL Visual Object Class (VOC) dataset [57], shown in Figure 3b,d. Figure 3b illustrates the residual image created by subtracting ResNet SUN RGB-D (640 × 512) from ResNet Pascal VOC (320 × 320) input images of pixel-level labeling. Figure 3c shows the residual image of subtracting ResNet SUN RGB-D (512 × 400) from ResNet Pascal VOC (320 × 320). Matlab Image processing Toolbox was used to create the residuals images and compare the performance. The location of the maximum difference is identified by a yellow star in Matlab shown in Figure 3b,d. Both classifiers showed good results for this study. However, ResNet SUN RGB-D (640 × 512) per-pixel classifier was chosen. It was used to identify the places SROBO is allowed to enter for autonomous navigation.
For object detection in this study, we used Model Zoo (https://modelzoo.co/; accessed on 18 May 2022) machine learning deployment platform to train our algorithm using Single-shot detector MobileNet and Inception (SSD-Mobilenet and SSD-Inception). SSD-Mobilenets [58] has a NN architecture tailored for mobile platforms with a significantly decreased required number of operations and memory [17]. It has a good level of accuracy and takes low-dimensional compressed inputs. It is initially expanded to the high dimension, followed by filtering with a lightweight depth-wise convolution [58]. Features are subsequently projected back to a low-dimensional representation with a linear convolution. It leaves less memory footprint as it never fully sets large intermediate tensors.
The object detection of 91 classes of pre-trained SSD-Mobilenet-v2 model was retrained on SROBO's control system using 100,000 images of indoor objects. In that, nine more categories were added to the pre-trained algorithm. The training was performed by adding five more batches and 40 training steps, which took around a hundred and 92 h to be completed. A total of 95,256 images were used for the retraining, from which 3098 used for validation and 3586 for testing, listed in Table 2. Google's open images (https://github.com/openimages/dataset; accessed on 18 May 2022) and Zoo Model dataset were employed for training. Semantic segmentation and object-detection algorithms were combined with RTAB-Map SLAM to facilitate autonomous navigation by SROBO. The framework enabled the robot to perform real-time image processing, pathfinding and collision avoidance. It is conducted by identifying the objects in the room and setting targets based on the objects detected in the scene. The system uses keyframes to label objects and later imports them to the 3D point cloud for optimization. Pixel-wise maps are created via semantic segmentation based on the 2D input images from the camera.
RTAB-MAP creates dense local point clouds using keyframes and the camera's depth information. It constructs the global point cloud by calculating the camera poses. Depth information is employed for precise localization. SROBO stops as soon as detecting and recognizing the target object in the scene.
Dynamic objects in the environment are identified based on the collected input images and their pixel values, comparing each image with the previous one. Object removal is conducted based on the pixel classification and boundaries using the nearest neighbor interpolation. The designed framework creates an optimized map where the SROBO can navigate autonomously. It benefits from RTAB-Map's feature-based visual odometry, loop closure detection, pose optimization and mapping in real time.
The proposed idea was compared with the original RTAB-Map using the sensors discussed in previous sections. Three indoor environments were used to validate the framework using the abovementioned algorithms. Table 3 compares the algorithms' mean average precision (Mean-AP) accuracy during real-time processing before and after implementing RTAB-Map. The comparison was conducted using the proposed system (fusion of semantic segmentation, object detection and RTAB-Map) with and without LiDAR sensor, regarding relative pose error (RPE), root mean square error (RMSE) and robustness. Three indoor scenarios were investigated, including a static scene (S 1 -no moving object), a dynamic scene (S 2 -one person moving) and a multi-dynamic scene (S 3 -four people moving) in a house with two corridors, three rooms, a kitchen and household items. Figure 5a represents the 2D map of the environment in which we conducted the experimental analysis (S 1 , S 2 and S 2 ). It shows the indoor layout and SROBO's two trajectory paths in the scenario S 1 , with the same start and endpoint. Figure 5b displays the 3D map of the front room with point clouds and the path SROBO was following. Figure 5c,d are the objects SROBO detected in the room with the accuracy percentage of 82.2%, 88.4% and 75.8%, respectively. Figure 5e depicts the trajectory of SROBO in S 2 scenario in which the test was conducted twice sequentially. It shows the path trajectory of SROBO and the path it follows in detecting the specified pixels following semantic segmentation. It noticed the first object (the couch) with 88.4% confidence. The figure also shows that the frame in which the 3rd couch had a confidence score of 75.8%, which is lower than the accepted confidence value set (higher than 85%). Thus, SROBO resumes tracking its path (pixels) until it reaches the accepted confidence score for the detected objects in the scene and stops where it began. Figure 5f,g illustrate the 3D map and the trajectory of SROBO in the 3rd scenario (S 3 ). Figure 5f shows the proposed system while using LiDAR sensor.
The RPE determines the local accuracy of the trajectory over a fixed time interval that corresponds to the drift of the robot's trajectory from a sequence of camera poses. The RMSE is calculated over all time indices of the translation component of the RPE to evaluate the drift per frame. The conventional RTAB-Map and proposed system's RMSE, mean and RPE values are illustrated in Table 4. The results showed that the proposed approach is 35% more robust (less average error) than the conventional RTAB-Map in all three scenarios. The RMSE values for the traditional RTAB-Map and the proposed system in S 3 scenario using LiDAR were 13% and 9%, and without LiDAR, it was 15% and 10%, respectively.  The results collected from SROBO while navigating autonomously illustrated that the robot could safely navigate the environment-avoiding dynamic and static obstacles. It also reached the targets set for it. Although occasionally faced system disconnections problems due to the required computational power for real-time image processing. The other issue was that the MYNTEYE camera requires high power to function.

Conclusions
Our results demonstrated that adapting contemporary classification networks into fully convolutional networks was practical. It transfers the learned representations by finetuning to the segmentation task. Shirked architecture merges semantic information from a deep, rough layer with appearance data from a fine surface layer to produce accurate and detailed segmentations [59]. This study developed a mapping system that enables SROBO to recognize the locations it is permitted to perceive. It detects the objects and follows the trail of the specified entities in the scene. It plans routes to a previously visited site, localizes itself on the map, and stops as soon as it detects an object confined in the location. It allows the SLAM to initialize a new map based on its allusions and transform it into the previously created map while a once-seen area is discovered. It avoids remapping the whole environment when only a new area should be explored and incorporated. This approach is considered in this study when dealing with semantic segmentation in images described in the following section. Furthermore, the position estimation error could be reduced by around 68% when combining vision-based position estimates with other sensor fusions [60]. RTAB-Map's localization and mapping improved by enforcing Deep-CNN, which furnished the robot with more awareness about the environment in which it functioned. The results showed that SROBO gained more insight into the environment into which it is exposed. It uses semantic segmentation to recognize the path it is allowed to track and follow the set constraint. Retraining the object-detection algorithm improved its recognition capability by 62%.
The experimental comparisons regarding pose accuracy, and swiftness, are conducted using RMSE. The results showed that blending RTAB-Map with Deep-CNN has improved performance by 20% to 48% compared to the conventional RTAB-Map. SROBO's target detection gradually enhanced as the algorithm learned and noticed the target object in its frame. The RMSE value increased in the S 3 scenario compared to S 1 was due to the existence of dynamic obstacles. However, there was no significant change in RPE values in different scenarios. During the experimental study, we faced some issues related to the robots' wheels. Friction affected the robot turning to its sides, where wheels came out of the rubber tracks. In future study, we will replace the wheels with an alternative solution for better performance. Future studies will also assess robot motion performance in a closed-loop setting using a VICON motion capture system to validate the proposed framework further.