A CNN-Based System for Mobile Robot Navigation in Indoor Environments via Visual Localization with a Small Dataset

: Deep learning has made great advances in the ﬁeld of image processing, which allows automotive devices to be more widely used in humans’ daily lives than ever before. Nowadays, the mobile robot navigation system is among the hottest topics that researchers are trying to develop by adopting deep learning methods. In this paper, we present a system that allows the mobile robot to localize and navigate autonomously in the accessible areas of an indoor environment. The proposed system exploits the Convolutional Neural Network (CNN) model’s advantage to extract data feature maps for image classiﬁcation and visual localization, which attempts to precisely determine the location region of the mobile robot focusing on the topological maps of the real environment. The system attempts to precisely determine the location region of the mobile robot by integrating the CNN model and topological map of the robot workspace. A dataset with small numbers of images is acquired from the MYNT EYE camera. Furthermore, we introduce a new loss function to tackle the bounded generalization capability of the CNN model in small datasets. The proposed loss function not only considers the probability of the input data when it is allocated to its true class but also considers the probability of allocating the input data to other classes rather than its actual class. We investigate the capability of the proposed system by evaluating the empirical studies based on provided datasets. The results illustrate that the proposed system outperforms other state-of-the-art techniques in terms of accuracy and generalization capability.


Introduction
Nowadays, the growing demand for servant robots has prompted researchers to improve technologies for automating these platforms. One of the most fundamental technologies for automating these platforms is their navigation system [1]. Autonomous mobile robot navigation is an active area of research and has gained the great attention of numerous researchers in the recent past [2,3]. Autonomous Mobile Robots (AMR) are designed and programmed to perform multiple tasks in substitution for humans. Humans are usually unable to perform such tasks due to multiple reasons such as bad environmental conditions or unhealthy and dangerous environments, or even repetitive tasks that can be boring. Due to these conditions and the need for robots in strenuous circumstances, the recent past has seen tremendous research in mobile robots. The intention is to have such devices, which could automatically perform most tasks without human intervention [4]. All the advantages, which can be achieved with AMR usage, also come along with numerous challenges. Localization and navigation are the main challenges in the self-controlling system of these types of robots.
Recent studies have made use of the Global Positioning System (GPS) to address the challenges of location and navigation. A number of studies have been built on the concept of GPS and are very successful in some applications [5]. Although the use of GPS proved to be successful in a number of applications and provides very high accuracy and

Related Work
In recent years, many models have been proposed that use these layers to perform the classification task. One of the pioneer models is the AlexNet model [14], which proposed Rectified Linear Unit (ReLU) [15] to improve accuracy and performance. The advantage of ReLU is that it rapidly speeds up the convergence process during the training of the network. Later, the VGG model [16] introduced factorized convolutions, which led to improved CNN-based models. VGG uses smaller receptive windows such as kernels or filters in the convolutional layers, which reduces the network's size and enhances performance. However, the parameter size is large, and it can not prove well for embedded systems. Google researchers released a version of their architecture called MobileNet [17]. MobileNet was designed especially for the mobile or embedded architecture-based systems, which have higher hardware constraints. This model reduced the training time and improved the performance. In addition, MobileNet comprised 28 layers in total and achieved an accuracy of 87 percent on the ImageNet dataset.
Junior et al. [18] proposed an online IoT system for mobile robot localization based on machine learning methods. They first collected a dataset by GoPro, and the Omnidirectional camera performed the localization task based on the image classification. They used the advantage of convolutional neural networks (CNN) to extract the features from images and then categorized them by a machine learning classifier algorithm. Although their model achieved good accuracy, it is not performing in an end-to-end fashion. Foroughi et al. [19] introduced a robot localization model in hand-drawn maps with the help of CNN and the Monte Carlo Method (MCL). Their localization model is in two phases. In the first phase, they extracted data through the different locations of the given map and trained these data on the designed CNN network to predict the robot's position. In the second phase, the robot's actual position on the map was then evaluated more accurately using the MCL model. The performance of their proposed model is highly influenced by the quality of the hand-drawn map. Zhang et al. [20] proposed a model named Posenet++ for the robot re-localization problem. Their model is a developed version of the VGG16 network that regresses the camera's position using the input image. Although they used weight from pre-trained models, their model still suffers from the accuracy bounded due to the limited size of the dataset. Radwan et a1. [21] introduced VLocNet++ to estimate the camera's position by utilizing a multitask learning method based on the CNN network. Their model integrated the semantic and geometric information of the surrounded into the pose regression networks. Nevertheless, their model may fail in an environment with similar features due to the limited generalizability. Hou et al. [22] proposed a point cloud map based on the ORB-SLAM algorithm for robot localization and 3D dense mapping in an Indoor Environment. Although their model performed well for the task of 3D mapping, even in a large-scale scene, the effect of the proposed model for the process of robot positioning is poor. Lin et al. [23] introduced a robust loop closure approach to eliminate the cumulative error caused by the long-term localization of mobile robots. Firstly, they use an instance segmentation network to detect objects in the environment and further realize the construction of a semantic map. Then, the environment is represented as a semantic graph with topological information, and the place recognition is realized by efficient graph matching. Finally, the drift error is corrected by object alignment. Although the proposed method has high robustness, it is not suitable for scenes with few objects. Using CNN and Q-Learning, Saravanan et al. [24] make robots autonomously search and reach plants in an indoor environment to monitor the health of the plant. They build an IoT enabled autonomous mobile robot using CNN and reinforcement learning. Ran et al. [25] design a shallow CNN with higher scene classification accuracy and efficiency to process images captured by a monocular camera. In addition, the adaptive weighted control algorithm combined with regular control are used to improve the robot's motion performance. However, its capability of self-learning and continuous decision-making is still not satisfactory.

Problem Statement
In this paper, we focused on addressing the coarse localization of mobile robots in an indoor environment. To do this, we made a scenario to deal with the localization and navigation problem of the home robot by using a CNN-based model as place recognition to identify the candidate places where the robot is possibly located.

Hardware
To do this task, a Pioneer 3DX is equipped with a MYNT EYE camera. The robot, along with its equipment, is shown in Figure 1. Table 1 represents the specifications of the sensor that is used for the proposed approach. Pioneer 3DX is a small lightweight two-wheel two-motor differential drive robot ideal for indoor laboratory or classroom use. The robot comes complete with front SONAR, one battery, wheel encoders, a microcontroller with ARCOS firmware, and the Pioneer SDK advanced mobile robotics software development package. The MYNT EYE camera is equipped with leading hardware solutions such as IR active light, IMU six-axis, hardware-level frame synchronization, global shutter, etc., up to 720 p/60 fps of synchronous image information.

Environmental Space for Robot Navigation
An apartment is chosen as the environmental space for robot navigation. This type of environment is selected because it has characteristics that contribute to the recognition and navigation of the robot. Figure 2b presents the 2D floor plan of the apartment. The dark space (black and gray) represents obstacles or occupied areas, and the white space represents the free space. The robot cannot move in the dark space and can only take action in the free space. In addition, since the robot has the freedom to move all over the free space of the working space, the application of the controlling theory will be complicated and even unnecessary. Therefore, we limited the robot's movements by allowing the robot to take actions only in the special areas of the apartment. Figure 2c shows the topological map of the apartment that some areas, such as yellow areas, are chosen as the robot workstation.
Each workstation is labeled into the four classes "Right, Up, Left, and Down (0 • , 90 • , 180 • , and 270 • )" on the topological map of the apartment. As mentioned, to make control theory easier, we limited the movement of the robot by following rules as follows: • The robot must have access to all the selected workstations in the apartment, but it does not need to have access to all over the free space of the apartment. • The robot is only allowed to move between the workstations (W1 to W8) in the straight direction of 0, 90, 180, and 270 degrees. Therefore, these areas must be chosen in a way that there should be no dynamic obstacle between them. As an example, if the robot is placed at any point on the W3, it can move by taking the 180-degree straight path to any point on the W8. • The robot is allowed to turn to the specific direction only inside the workstations, not anywhere else out of these areas.
As shown in Figure 2c, eight areas are chosen as the robot workstation for this apartment. The robot has access to all the rooms of the apartment such as kitchen, bedrooms, living rooms but not to all the free spaces of the mentioned rooms. There is at least one workstation for each room, but the rules mentioned above might not be enough. For example, W6, W7, and W8 are assigned to the living room since having only one unique workstation in this unit will not fit the requirement, as mentioned earlier. For example, if the robot is placed in the W7 area, it cannot go to the W4 by taking the straight path; therefore, W6 is born. This situation usually happens when a room has a border with several other rooms. As mentioned, each workstation is labeled into four classes; therefore, there are 32 classes for the apartment used in this study, including W1-Right, W1-Up, to W8-Down. Furthermore, a set of paths is designed as the robot pathway to move between workstations. Some commands for the robot navigation are present in Table 2.

Data Acquisition
By utilizing the MYNT EYE camera, a dataset is created. Using MYNT EYE, we captured RGB images with an angle of 105°height and 58°. Images were captured in positions equivalent to the classes W1-Right to W8-Down. The dataset consisted of 534 Images, with a different number of images per class (due to the different size of each area) with a resolution of 798 × 448 pixels. Figure 3 and Table 3 present the sample images of the dataset and the number of images for each class, respectively.

Methodology
In this section, the proposed approach for mobile robots' localization and navigation is expressed in detail. The proposed system consists of computer vision processing, which incorporates a topological mapping method and CNN models. In the designed system, the user submits a request to send the robot to the desired destination. The system first creates a route from the robot's current position to the destination. The route is then broken into the commands and runs line by line individually. To execute each command, an image request is first sent to the sensor (camera), which is installed on the robot. The sensor takes an image of the environment in which it is located and sends it to the CNN unit. There are two CNN-based models in the CNN unit. The first model is responsible for estimating the position of the image received from the sensor. In fact, the model that is trained on the dataset beforehand assigns the appropriate class to the image. At this stage, if the location predicted by the first model is the destination desired by the user, the system will send a message to the user stating that the mission has been completed and will wait for the next command. Otherwise, the second model is executed, and, if there is a doorway in the same image, the model detects whether it is closed or open. Closed door means that the robot does not have access to the next destination, in which case an appropriate message will be delivered to the user. However, if there are no closed doors and no obstacles in the pathway, the command is eventually executed, and the robot takes action, which can include turning at a certain angle or moving forward. The flowchart of the proposed approach is presented in Figure 4. The prepared dataset for this study has 32 classes and 534 images. Dealing with such a small dataset with this number of classes for classification requires specific consideration. Deep learning-based models require large amounts of data to show their effectiveness; therefore, the majority of well-known classification models are designed to perform on the dataset with a large amount of data, such as ImageNet [26], which contains millions of images and thousands of classes. Despite the advances made in deep learning in recent years, there are not many standard methods to design a network to fit private and small datasets. Therefore, there are only a few basic points to consider when designing a network to fit the small datasets.
Overfitting is one of the common problems that can happen when the model suffers from a lack of data. Overfitting occurs when a network performs extremely well on the train data but performs poorly on the unseen data. To prevent overfitting on the small datasets, not training the data on the complex model must be considered. The very deep convolutional networks with a huge number of parameters may work well on large datasets, but they are not flexible when exposed to small datasets. Lack of generalization is another major problem, which can happen due to the lack of data. The data augmentation technique is one of the most effective techniques to address this problem. This technique can increase the amount of the data by applying different transformations to the training set, such as flip, rotate, crop, and many others.
Considering the above explanations, we proposed a simple CNN-based model to classify the prepared dataset into one of 32 categories. The classifier model is trained on the prepared datasets for the classification task using the architecture presented in Figure 5. The input of the proposed classifier model is an RGB map image with a fixed size of 256 × 256, and it is consists of four convolution layers, four max-pooling layers, and two fully-connected layers. The convolution layers are equipped with the ReLU activation function. The number of output filters in the first, second, third, and fourth convolution layer is 8, 16, 32, and 64, respectively, and two first convolution layers have kernel sizes of 5 × 5 with stride 5 × 5, and the others have sizes of 3 × 3 with stride 3 × 3. After each convolution layer, there is a max-pooling layer with a pool size of 2 × 2 and stride 2. After that, there are two fully-connected layers, where the first one has 512 channels and the second one has 128 channels and is equipped with the ReLU activation function. Finally, the network ends with the softmax [27] layer. The configuration of the proposed classifier model is outlined in Table 4.

Controller Model
The proposed controller model is a CNN-based architecture, which detects the presence of possible obstacles in the pathway of the robot by analyzing the image taken from the sensor. These potential obstacles include closed or semi-open doors that prevent the robot from reaching the desired location. The robot can only move from one workstation to another if there is no door between them or be fully open. All the images in the prepared dataset have one of these four modes, including without door, with the fully-open door, with the semi-open door, with the fully closed door ( Figure 6). We have not created a new dataset to detect the door's situation in the image and only adjust the prepared datasets according to the needs of the controller model. Therefore, the prepared dataset is classified into two groups as free space and occupied. The free space group contains images with no or fully-open doors, while the occupied group contains images with semi-closed or fully-closed doors. The adjusted dataset for training on the doorway controller model has 534 images, of which 282 images belong to the free space class, and the rest belongs to the occupied class. Considering the above explanations, we proposed a CNN-based network as the controller model, classifying the prepared dataset into one of two categories. The architecture of the proposed controller model is exactly the same as the proposed classifier model ( Figure 5), except that the softmax layer has two outputs.

Data Normalization
In deep learning, the quality of a model is largely affected primarily by training data [28], but there are some techniques to optimize the performance of the model. One of the effective and very popular techniques is normalization. In deep learning-based models, any changes in earlier layers distribution will be enhanced in the next layers; therefore, each layer tries to adjust to the new distribution. This problem, called Internal Covariate Shift, reduces the speed of the training process and causes the network to take more time to reach the global minimum. Normalization techniques are the standard method to overcome this problem. Besides reducing the Internal Covariate Shift, normalization standardizes the inputs, smooths the objective function, stabilizes the learning process, and decreases the training time. Data normalization is performed to improve the training process of the deep neural network and is usually performed as a preprocessing step before the actual execution of the training process. In data normalization, the value of each feature is mapped to the [0,1] range, where 1 represents the maximum value and 0 represents the minimum value for each feature. Most of the systems have features, which have values, where the difference between the maximum and minimum values has a very large scope. Therefore, it is necessary to apply the logarithmic scaling method for scaling to obtain the values in suitable ranges. A logarithmic scale displays numerical data over a very wide range of values in a compact way.
Batch normalization [29] and layer normalization [30] are the two most well-known normalization techniques. In order to investigate the effectiveness of the mentioned methods, we modified and trained both proposed models once by adding a batch normalization layer after each convolution and once by adding layer normalization after each convolution layer. The results are illustrated in Figure 7.
As it can be seen in Figure 7, batch normalization did not improve the performance of the proposed classifier and controller models, which can be because of the size of the dataset. Batch normalization depends on the size of the mini-batch [31] so that the small mini-batch size makes the training process noisy. As the size of the prepared dataset is small, the size of the mini-batch is therefore small as well, and, as a result, batch normalization could not play an effective role. Figure 7 also shows that layer normalization effectively increased the accuracy of both training and validation sets. In addition, the proposed models with layer normalization decreased the noise of the training and validation curves in comparison to the other two models, which makes it more reliable.

Cost Function
The goal of the neural network during the training process is to reduce the loss function. This means reducing the difference between computed and targeted values. Cross entropy (CE) loss is one of the most well-known and popular loss functions for classification. In a C − class classification task, the training dataset with a batch size of m can be defined as , and y i = {1, . . . , C} is the true label of x i . The network's probability when label x i is j = {1, . . . , C} can be defined as Equation (1): where z i, f is the network's prediction. The formal definition of the CE loss for C − class can be computed as follows: When x i is labeled as y i (y i = j), then 1(.) = 1; otherwise, for all j = y i , 1(.) = 0. Based on Equation (2), the CE loss only considers the probability of the input data when it is allocated to its true class. As mentioned earlier, the number of images of the prepared dataset is quite small, and each class contains a few images. Therefore, in order to increase the performance of the proposed network, we modified the CE loss so that the probability of allocating the input data to other classes rather than its actual class is also considered. Equation (3) represents the loss when the data are allocated to other classes instead of its true class: When x i is not labeled as y i (y i = j), then 1(.) = 0; otherwise, for all j = y i , the 1(.) = 0. In addition, α is a variable for controlling the scale of the loss when the data are allocated to other classes instead of its actual class (α ≥ 0). By integrating Equations (2) and (3), the proposed loss function (L task ) can be obtained, which not only considers the probability when the data are labeled truly but also consider the probability when the data are labeled wrongly: Besides increasing the probability when the data are labeled truly, at the same time, the proposed loss decreases the probability when the data are allocated to other classes instead of its true class.

Weight Loss
As presented in Table 3, the prepared dataset for the classifier model involves multiple labels for various locations of the apartment (the same also holds true for the dataset prepared for the controller model). Since the number of images varies for different classes, and, in some classes, the imbalances are higher, which is why the dataset needs to be balanced within each class. As a result, we designed the weight loss plan in order to correct the data imbalances. The weight (ω C ) is defined as follows: where n max is the number of images of the biggest class in the dataset, n c represents the current classes' image number, ∑ C j=1 n j indicates the total number of images of the dataset, and β is a variable for controlling the scale of the weight loss. Using the presented weight, the classes with smaller data sizes attain the weight with a larger value, and the classes with bigger data sizes attain the weight with a smaller value. By applying Equation (5) into Equation (4), the weighted loss function is defined as follows: To evaluate the effectiveness of the proposed loss, we trained both of the proposed classifier and controller models once by using CE loss once by adopting the proposed loss. The results are presented in Figure 8. According to Figure 8, the proposed loss slightly improves the accuracy of both proposed classifier and controller models. In addition, compared to the CE, the proposed loss decreased the noise of the training and validation curves, made them smoother, and converged faster to maximize accuracy:

Empirical Results
In this section, the results of the proposed classifier and controller models, along with the proposed loss, are presented. First, evaluation parameters are described, and then the experimental setup, which is used for the execution and implementation of the proposed system, is detailed. Finally, the performance of the proposed classifier and controller models is compared with state-of-the-art models.

Evaluation Parameters
To evaluate the performance of the proposed classifier and controller models, the standard metrics are used, including accuracy and F1-Score. Let us define TP as True Positive, TN as True Negative, FP as False Positive, and FN as False Negative. Mathematically, these evaluation metrics are defined as Equations (7) and (8): Accuracy is one of the most standard metrics to evaluate the performance of the CNNbased classification models. Accuracy is a metric of how close a measured value is to its standard or reference value. It means that the measurements for a given object are relative to the known value, but the measurements are far from each other, meaning that the accuracy is without precision. The F1-Score, another standard metric for evaluating the classification problem, is a weighted mean of precision and recall. Precision (TP/(TP + FP)) is the quality of being exact. It refers to how two or more measurements are close to each other, regardless of whether those measurements are accurate or not. Recall (TP/(TP + FN)), also known as sensitivity, measures the ratio of the relevant results predicted correctly by the model. In the case where both precision and recall are influential, the F1-Score can be used.

Training Strategy and Parameter Tuning
In this step, the effects of different parameters on the proposed classifier and controller models are investigated. These parameters are including, input image size, training testing split ratio, and elements of proposed loss. Whenever an experiment is conducted, one parameter is replaced while the other parameters stay unchanged. In addition, each experiment is taking in 10-fold cross-validation (run ten times) method on both prepared datasets.
The tools and technologies employed for implementing and developing the proposed system primarily consist of Python developmental language V3.5. For deep learning model creation, the Keras framework based on Tensorflow 2.1 library is used. All the training models and procedures are executed on NVIDIA GeForce GTX 1080TI with 8 GB memory. The Stochastic Gradient Descent (SGD) [32] with a momentum of 0.9 is adopted as the optimization. The SGD is an efficient algorithm developed to reduce the loss function in the quickest way possible. The initial learning rate is set to the 3 × 10 −5 reducing mechanism when validation loss stopped improving. The dataset is divided into a number of mini-batches during the training process. The batch size is set to 16, which means that the input data are divided into 16 batches and processed each batch in parallel. To prevent the possible overfitting, an early-stop strategy is used to monitor the validation loss improvement.
In addition, to make data more robust, an image data generator feature (data augmentation) is used. Data augmentation is a powerful feature and lets the data be enhanced in real-time while the deep learning model is still training or in the execution phase. Using the Keras library, there are a number of options, such as contrast random normalization and geometric transformations, which can be selected to perform the different transformations on data. Any transformation can be applied to any image present in the dataset while it is transferred to the model during training. Utilizing this feature offers two major benefits: saving time and memory overhead during the training phase and making the model more robust. Table 5 presents the parameters and their values, which are selected for the train data generator. The classification performance of the proposed classifier and controller models is compared by given different input image sizes, including 128 × 128, 228 × 128, 256 × 256, and 465 × 256. Both models are trained on prepared datasets with these sizes, and results on the test set are illustrated in Table 6.

Training Testing Split Ratio
The training-test split method is a technique to evaluate the performance of deep learning models for prediction on the data that has never been used during the training process. In order to achieve the best classification performance of the proposed models, the dataset is divided into five different training and testing ratios, including 90:10, 80:20, 70:30, 40:60, and 50:50. The X:Y split ratio means the X% of the data are used for training of the deep learning model while the rest of the Y% data are used to test the model. As each prepared dataset has a total number of 534 images when split into 90:10 ratio, 481 images are selected for training, and 53 images are selected for testing the model. Both proposed classifier and controller models are trained on the prepared datasets with the mentioned split ratio, and the results on the test set are presented in Table 7. As presented in Table 7, the models, when trained with split ratios of 90:10 and 80:20 obtain almost similar results, which is better than when the model trained with split ratios of 70:30, 60:40, and 50:50. However, both proposed models attained the best classification results when trained with a split ratio of 80:20 by achieving an accuracy of 96.2% and 98.6% for classifier and controller models, respectively.

Effect of α and β of the Proposed Loss Function
As expressed in the methodology section, the parameters α and β are defined to regularize the proposed loss function. The α is defined to control the scale of loss and β for controlling the scale of class weight. In order to evaluate the effect of α and β on the performance of the proposed loss, different values are given to these two variables. Furthermore, the proposed models are trained on both prepared datasets with the various values of α = {0, . . . , 5} and β = {1, . . . , 5}, and results on the test set are presented in Figure 9. When α = 0, the proposed loss will be the same as weighted CE loss, which only considers the probability of the input data when it is allocated to its true class. The proposed loss also considers the probability of allocating the input data to other classes rather than its actual class. Therefore, when the α has a smaller value, the loss focuses more on the data that the model correctly predicts their labels, and when the α has a bigger value, the second part of the loss function (which is responsible when the label of the data are not correctly predicted) becoming the main part. The best value for α and β can be achieved empirically and must be adjusted for each dataset separately. On the prepared datasets, both proposed models show their best performance when α and β are set to 2 and 5, respectively.

Comparison of Proposed Loss and Cross-Entropy loss
The performance of the proposed loss is further compared to the CE loss. As mentioned earlier, the proposed loss is an extended version of the CE loss. The CE loss only considers the probability of the input data when it is allocated to its true class, but the proposed loss, in addition to the advantages of the CE, also considers the probability of allocating the input data to other classes rather than its true class. For comparison, the proposed classifier and controller models are trained in 10-fold cross-validation (run ten times) method once with employing the proposed loss and once with CE loss. In order to have a fair comparison, in both experiences, all the parameters and training strategies are the same, and the only difference is the loss function. The classification outcomes on the test set are monitored, and the best results are picked until the 25th, 50th, 75th, and 100th epochs. The results are shown in Figure 10. As shown in Figure 10, the proposed loss leads the network to obtain a better performance compared to CE loss for both classifier and controller models. It can also be seen from Figure 10 that the classification accuracies for both experiments are increasing as the number of epochs increases, and both proposed classifier and controller models achieved their best performance at epoch 100. In addition, almost in all experiments, the error bar indicates that, when the model is equipped with a proposed loss, it has a smaller standard deviation than when the model is equipped with CE. In total, based on the presented results, it can be concluded that the proposed loss not only improves the performance of the models by increasing the accuracy but also speeds up the optimization procedure and subsequently decreases the training time. The results for 100 epochs are presented in more detail in Table 8.

Comparison of the Proposed CNN Model with State-of-the-Art Models
After key parameters evaluation, the performance of the proposed classifier and controller models are compared with three state-of-the-art models, including AlexNet, MobileNetV2, and VGG-16. As discussed in the methodology section, state-of-the-art models are designed to perform on the giant dataset, and, due to their convolution size and depth, overfitting occurs if they perform on the small dataset. Therefore, to train and make these models work properly on the prepared datasets, all of them are modified, and their hyperparameters are tuned according to the prepared datasets. The main modification is replacing the original fully-connected layers of these models with the fully-connected layers as designed for the proposed model. This significantly reduces the number of parameters. In addition, layer normalization is added to the main body of these models after the convolution layers. Finally, a strong dropout [33], with p = 0.5, is applied after the first fully connected layer.
The proposed classifier and controller models and adjusted AlexNet, MobileNet, and VGG-16 are trained on the prepared dataset with batch size 16 for 100 epochs. All the models are trained once with CE and once with proposed loss. For the proposed loss, the values of α and β are set to 2 and 5, respectively. SGD is employed as the optimizer with an initial rate of 3 × 10 −5 and a momentum of 0.9. Data are split into a training and test set with an 80:20 ratio. All the models are trained by providing them with the augmented training dataset. The size of input data for the state-of-the-art models have not changed, and the original size has been used, which is 256 × 256 for AlexNet, 224 × 224 for MobilenetV2 and VGG-16.
As the results of the test set of the prepared datasets are presented in Table 9, the proposed models, when equipped with proposed loss, outperformed the other models by obtaining the accuracy of 96.2% and 89.6% for classifier and controller models, respectively. Moreover, all the models show slightly better performance when they are equipped with the proposed loss comparison to when they employed CE as the loss function. More wellknown classification models were intended to be compared with the models as mentioned above, but they had pretty similar performance as presented models and therefore not been mentioned for comparison. These comparisons are not to demonstrate that the proposed classifier and controller models achieved better performance than the state-of-the-art models, but to illustrate that shallow networks such as the proposed classification model are simple yet accurate and effective for a small dataset. Table 9. Performance comparison between the proposed models with AlexNet, MobileNetV2, and VGG16. The results on the test sets are measured by accuracy and F1-Score (mean ± std). For each measurement, the best result is bold. PR-L and CE-L indicate the proposed loss and Cross-Entropy loss, respectively.

Backbone
Cost Function

Comparison of the Proposed CNN Model with State-of-the-Art Models
To evaluate the effectiveness of our approach, we examined the performance of the proposed system for mobile robot localization and navigation in the real environment. As mentioned, a real apartment (Figure 2) is chosen to take the experiments. Five different routes (see Table 2) are designed to evaluate the navigation system of the mobile robots.
For each route, if the robot executes all the commands correctly from the starting point and reaches the destination, it is considered a 100% success. Otherwise, if it makes a mistake during taking the route, the success rate will be calculated until the last command is done correctly. For example, if the robot is to move from workstation 1 to workstation 8 (route 1), first it should rotate to the angle of zero (T0), then move straight (Go), and finally stop at workstation 8 (SW8). In addition, before each "Go" command, the system checks the existence of the doorway and only allows the "Go" command to be executed if there is no obstacle in the way. Therefore, to complete path one, four commands must be executed correctly. If the proposed system correctly detects that there is an obstacle on the pathway (closed or semi-closed door), it stops the operation. In this case, only two commands are performed, rotation to the angle of zero (T0) and the obstacle check, which is considered as a 100% success. However, if the system mistakenly detects an obstacle in the path and stops the operation, then only rotation to the angle of zero (T0) command is executed correctly, which is a 25% success rate. We tested each of the five routes fifteen times. We left all the doors open five times for the robot to pass through without any obstacles. We also left at least one of the doors halfway open five times and closed at least one of the doors in the path five times. We tested each designed route fifteen times. Five times, we left all the doors open for the robot to move through the pathway without facing any obstacles. Five times, we also left at least one of the pathway's doors halfway open, and another five times left at least one of the pathway's doors fully closed.
We took the experiments of the system, with the proposed and VGG-16 backbone for classifier and controller models, along with proposed and CE loss. The experiment results of the proposed mobile robot localization and navigation system in the real apartment environment are presented in Table 10.
As the experimental results in Table 10 show, the designed system performs better when adopting the proposed loss than when employing the CE loss. In addition, the best performance of the proposed localization and navigation system is obtained when the proposed backbone is used for the classifier and controller models along with the proposed loss function.

Discussion
In this work, we proposed the CNN-based approach to address the image classification problem, which is a part of the proposed system for autonomous mobile robot localization and navigation in an indoor environment. We prepared a dataset with a small number of images. However, training deep learning models on a small data set is challenging, as they require a significant amount of data. Training the deep learning models on the small dataset may result in overfitting or performance limitations such as bounded accuracy. Therefore, to overcome these problems, we employed the following strategies: 1.
Instead of the complicated deep learning-based models, which usually are designed to perform on large datasets, we proposed a simple CNN-based model to deal with small prepared datasets. According to the results, the intended strategy for building the CNN model was an effective way for classification with a small dataset.

2.
We proposed a loss function that not only deals with the data that the model predicts their labels correctly, but also considered those data that are assigned to other classes instead of their actual class. In addition, the loss function has two effective parameters of ℵ and β which control the scale of the loss and the scale of class weight, respectively. The results have demonstrated that the proposed loss function improved the performance of the proposed classifier and controller models, speeds up the optimization procedure, and subsequently decreases the training time.

3.
Tuning of the hyper-parameters has led the networks to perform better by achieving higher accuracy rates. Data augmentation effectively improved the performance of the models by increasing the diversity of the training sets. Layer normalization made the models more reliable by reducing the noise of the training carves, and the SGD optimizer sped up the training process.

Conclusions
In this paper, we have introduced an approach based on the CNN model via image classification to address the localization and navigation of the mobile robot in an indoor environment. In combination with the proposed method and the topological map of the workspace, the mobile robot can find its location and navigate autonomously. Appropriate loss function has been proposed to improve the generalization capability since the prepared dataset has a small number of training images. The empirical studies demonstrated that the proposed techniques can effectively improve the performance of the neural networks, and thus benefit the navigation system. In future work, we would like to improve the proposed system so that the robot navigates autonomously in more complex environments. The limitation in presented work is when the environment has fewer features or has several areas with the same features, such as hospitals or hotels where there are many rooms with the same shape and characteristics. We strongly believe that more hardware and sensors such as Laser or RGB depth camera are required to overcome such a problem.

Conflicts of Interest:
The authors declare no conflict of interest.