1. Introduction
The rapid development of robotics and artificial intelligence applications is leading to the proliferation of mobile service robots [
1,
2]. Technological advancements, such as artificial intelligence and machine learning, have significantly improved the capabilities and autonomy of these robots, making them more efficient and reliable in performing various tasks. Additionally, the increasing demand for automation and efficiency in industries such as healthcare, hospitality, and logistics has created a strong market incentive for developing and deploying service mobile robots.
Also, the growing need for eldercare robots has become increasingly evident as the global population ages. These robots can provide valuable assistance and companionship to older adults, monitoring their health and enhancing their overall well-being [
3]. However, these robots must be affordable to ensure widespread accessibility and adoption among families and caregivers [
4].
A common requirement in these service robots is to be able to localize within their workspace, which is usually a man-made indoor environment [
5]. Although precise position tracking can be provided by a SLAM (simultaneous localization and mapping) system using vision or RGB-D data, the issue of global localization remains a problem when the robot’s previous position data cannot be used [
6]. Such a problem in practice arises, for example, in dynamic environments due to occlusions. There are practical global localization algorithms, such as the one proposed in our previous work [
7], but they have two functional limitations, namely, they require long-range sensors to extract features that are distant from the robot and are computationally expensive. These features make them unsuitable for a small and inexpensive service robot.
Therefore, we propose a solution to the problem of global localization in a known (entirely or partially) environment using a passive catadioptric camera and the principle of recognizing places previously visited by the robot (
Figure 1). The applied sensor with a catadioptric camera is a variant of the biologically inspired sensor with a hybrid field of view that we introduced in [
8]. This sensor uses a catadioptric camera to achieve omnidirectional vision, an analogue of the peripheral vision found in vertebrates [
9]. It allows animals to orient themselves to changes and hazards in the environment quickly. The sensor described in [
8,
9] is complemented by a moving perspective camera that performs the functions of foveal vision, the more accurate but spatially limited vision mode in animals. This function is not used in the research presented in this article, as we limit the scope to global appearance-based localization, i.e., the assignment of the robot’s current location to one of the previously recognized (visited) places. Our approach yields information about the similarity of the places observed in the current perception and locations stored in a reference map.
Although appearance-based localization does not provide an accurate metric position of the robot in a global reference system, the ability to tell if the robot is close to one of the known locations is often sufficient for indoor navigation [
10]. If a map of reference places is collected at high density (e.g., based on a grid with cells of one meter in size or smaller), this kind of localization may be sufficient for the service robot’s tasks. In addition, appearance-based localization can be supplemented by visual odometry or the recognition of artificial landmarks deployed at a given location [
11]. The perspective camera of a hybrid sensor can be used to perform these functions. The main objectives of this research work are the following:
Experimental analysis of neural network architectures in search of an architecture for an image-based place recognition system suitable for implementation on an embedded computer of an intelligent vision sensor with limited power and resources.
Experimental verification of the possibility of using catadioptric camera images in the appearance-based localization task without developing them into panoramic form significantly reduces the computational load.
Analysis of the strategy for creating training sets in a place recognition task, assuming that the obtained solution should be generalized to different image acquisition conditions, mainly depending on illumination.
We propose a novel approach that adopts a convolutional neural network (CNN) architecture to directly process the omnidirectional images for real-time place recognition to meet these objectives. CNNs are specialized for processing grid-like data, particularly images, using convolutional layers and parameter sharing to capture spatial patterns effectively. The proposed system leverages the concept of global image descriptors, which are already proven to be efficient in place recognition [
12]. We employ a CNN to produce the descriptors in the form of embedding vectors directly from the omnidirectional images, thus avoiding the processing overhead required for computing undistorted panoramic images, which are often used in appearance-based localization with catadioptric cameras [
13]. The proposed architecture is optimized for inference on the Nvidia Jetson TX2 edge computing platform integrated with our sensor. The low-cost Jetson TX2 board is designed for peak processing efficiency at only 7.5 W power. Regarding energy consumption for image processing, the Jetson TX2 has a clear advantage over an x86-based platform [
14]. While the exact power consumption will depend on the specific image processing workload, the Jetson TX2 is designed to provide a good balance between performance and energy efficiency [
15]. Hence, by applying an integrated sensor with an edge computing platform and developing a matching small-footprint neural network architecture, we obtain a self-contained, energy-efficient, and compact system for real-time appearance-based localization that can be integrated with practically any mobile service robot, providing this robot with reliable global localization capabilities at low cost. The contribution of this paper is threefold:
A novel, simple-yet-efficient CNN-based architecture of the appearance-based localization system that leverages a lightweight CNN backbone trained to apply transfer learning to produce the embeddings and the K-nearest neighbours method for quickly finding an embedding matching the current perception.
A thorough experimental investigation of this architecture, considering several backbone network candidates and omnidirectional or panoramic images used to produce the embeddings. The experiments were conducted on three different datasets: two collected with variants of our bioinspired sensor and one publicly available.
An investigation of the strategies for creating the training set and the reference map for the localization system conducted on the COLD Freiburg dataset. This part of our research allowed us to test how our neural network model generalizes to images acquired under different lighting/weather conditions. It resulted in the recommendation of using data balanced concerning their acquisition parameters, improving generalization.
The remainder of this article is structured as follows. Most important related works are reviewed in
Section 2.
Section 3 introduces the proposed architecture of the localization system and details the neural networks being used. Next,
Section 4 describes the experimental setups and dataset used to test various aspects of the proposed solution, while
Section 5 provides the results of experiments and contributes an in-depth analysis of the performance of different variants of the investigated system. Finally,
Section 6 concludes the article and proposes future extensions.
2. Related Work
Appearance-based localization from omnidirectional images has garnered significant attention in computer vision and robotics. Researchers have developed various techniques to address the challenges posed by the distortion and wide field of view of omnidirectional cameras. This section reviews the most relevant works that have contributed to the state of the art in this area.
The application of passive vision sensors for localization was extensively researched in robotics, resulting in several visual Simultaneous Localization and Mapping (SLAM) algorithms [
16]. However, the applications of visual SLAM on commercially viable mobile robots are limited by the often-insufficient on-board computing resources of such robotic platforms and due to problems raised by the changing lighting conditions, rapid changes of viewpoint while the robot is moving, and the lack of salient local features in some indoor environments. Moreover, SLAM does not guarantee to solve the global localization problem whenever the robot loses track of its pose due to any of the issues mentioned above [
17].
Therefore, the appearance-based recognition of locations becomes an exciting addition to visual SLAM for closing the loops and relocalizing a lost robot [
18]. This approach, in many variants, is also considered a localization method on its own, which is particularly suitable for large-scale outdoor scenarios [
10]. Unlike the visual SLAM algorithms, appearance-based localization methods only determine if the observed scene resembles an already visited location. However, the place recognition methods scale better for large environments than typical SLAM algorithms [
19]. In this context, catadioptric cameras yielding omnidirectional images improve the reliability of place recognition for robot localization in comparison to the narrow-field-of-view perspective cameras, as demonstrated by the work on the COsy Localization Database (COLD) dataset [
20], which we also use to evaluate our localization system. An interesting research direction is to use image sequences instead of individual images, which decreases the number of false positives in place recognition for environments with self-similarities and increases the robustness of scene dynamics [
12]. We applied this idea in our earlier work on place recognition for mobile devices [
21], making it possible to implement robust place recognition on a smartphone with very limited computing power, while still using nondistorted perspective images.
In the appearance-based methods, each image is described by descriptors of salient features contained in this image, or is directly described by a whole-image descriptor. Although SURF features were used directly in appearance-based localization performing image retrieval in a hierarchical approach [
22], the direct matching of local features is considered inefficient for place recognition [
10] if point feature descriptors are used (such as the popular SIFT, SURF, and ORB [
23]). Hence, the bag of visual words (BoVW) technique [
24] is commonly used, which organizes the features into a visual vocabulary. Next, images described by visual words can be efficiently matched by comparing binary strings or histograms. One prominent example of a location recognition algorithm employing the BoVW technique is FAB-MAP [
25,
26], which efficiently compares images with a histogram-based approach.
Global image descriptors have proven effective for capturing the overall appearance of omnidirectional images [
27]. Earlier works focused on adapting existing, general-purpose feature extraction and matching algorithms. Menegatti et al. [
28] proposed using the Fourier transform to handle geometric distortions in catadioptric images. More recently, Payá et al. [
29] introduced a method based on the Radon transform to extract global environmental descriptions from omnidirectional images. These works provided foundations for subsequent research by addressing the specific characteristics of omnidirectional images. Examples of hand-crafted descriptors adopted for the global description of omnidirectional images include HOG (histogram of oriented gradients) [
30] and Gist [
31], which were applied to omnidirectional images from a catadioptric camera in appearance-based localization by Cebollada et al. [
32]. While both these methods of image description provided relatively efficient descriptions of the images, allowing the localization system to recognize the places accurately, the descriptor construction algorithms initially developed for perspective camera images required the catadioptric images to be undistorted and converted to panoramic images, which creates a significant computation overhead.
Machine learning methods have gained popularity in place recognition, also from omnidirectional images [
33]. Working with typical perspective images, Li et al. [
34] proposed an image similarity measurement method based on deep learning, which combines local and global features to describe the image and can be used for indoor place recognition for a robotic agent. Significant progress in appearance-based localization and navigation was achieved by the NetVLAD approach [
35], a CNN-based method that aggregates local features for global image representation. The NetVLAD network consists of a CNN for feature extraction and a layer based on vector of locally aggregated descriptors—VLAD [
36]. In this architecture, VLAD is a feature quantization technique similar in concept to the bag of visual words idea, as it captures information about the statistics of an image’s local descriptors. The VLAD is a method for combining descriptors for both instance-level searches [
37] and image classification [
38]. Although Cheng et al. [
39] used NetVLAD with panoramic images from an omnidirectional system, this approach was demonstrated successfully, mainly in outdoor scenarios working with perspective camera images. For indoor scenarios, [
13] introduced the omnidirectional convolutional neural network (O-CNN) architecture, which, similarly to our approach, is trained to retrieve the closest place example from the map. Whereas the O-CNN architecture takes advantage of the omnidirectional view by incorporating circular padding and rotation invariance, it requires the omnidirectional images to be converted to their panoramic counterparts. Also, Cebollada et al. [
40] demonstrated the benefits of solving localization problems as a batch image retrieval problem by comparing descriptors obtained from intermediate layers of a CNN. A CNN processing rectangular panoramic images reconstructed from the original catadioptric input is used in this work.
As the construction of invariant feature descriptors for omnidirectional images is problematic, Masci et al. [
41] proposed to learn invariant descriptors with a similarity-preserving hashing framework and a neural network to solve the underlying optimization problem. Ballesta et al. [
42] implemented hierarchical localization with omnidirectional images using a CNN trained to solve a classification task for distinguishing between different rooms in the environment and then a CNN trained for regression of the pose within the recognized room. Although this solution does not require converting the catadioptric images into panoramic ones, its performance is limited by the employed two-stage scheme with separated classification and regression steps. More recent work from the same team [
43] solved the appearance-based localization problem by applying a hierarchical approach with the AlexNet CNN. Assuming an indoor environment, they first accomplished a room retrieval task and then carried out the fine localization step within the retrieved room. To this end, the CNN was trained to produce a descriptor, which was compared with the visual model of the selected room using a nearest neighbour search. This approach does not require panoramic conversion of the collected catadioptric images and is overall most similar to the solution proposed in this paper. However, we introduce a much simpler, single-stage architecture based on a recent, lightweight CNN backbone, and the concept of direct retrieval of the image stored in the environment map, which is most similar in appearance to the query image. The efficient process of constructing the embeddings from a pretrained CNN, followed by a fast comparison of these embeddings/descriptors in the KNN framework, allowed us to give up with separated room retrieval in favour of a single-stage architecture, which suits our embedded computing platform well. We compare it directly to the results shown in [
43] on the COLD Freiburg dataset, demonstrating our approach’s superior performance and real-time capabilities.
3. Localization System Architecture
In the proposed localization system, the robot figures out its current location by determining the similarity between the currently captured image (query image) and images stored in a database (map) describing the environment. This task refers to efficient, real-time image retrieval [
10]. The localization procedure involves comparing a global descriptor constructed in real time from the image currently captured by the robot with a previously prepared database of descriptors representing the images of previously visited places and finding the image with the highest possible similarity in the feature space (
Figure 2). Each location has its representation in the prepared database of images, and the locations where the images were taken are assumed to cover the entire robot’s workspace. Images from the database are recorded at known locations, so finding one with the minimum distance (in the sense of similarity of appearance) to the current perception allows our robotic agent to approximate its location in the real world.
The proposed localization system uses a CNN to determine the set of natural features for a given location, and the K-nearest neighbours (KNN) [
44,
45] algorithm to find the closest image from the provided database of images.
CNN and KNN are both machine learning techniques, but differ in their approach and application. CNN learns hierarchical representations of data through multiple convolutional layers, pooling, and fully connected layers. In contrast, KNN is a simple and intuitive algorithm for classification and regression tasks. It makes predictions based on the similarity of new data points to the existing labelled data points in the feature space. One can use CNN to extract features from images and then apply KNN to those extracted features for classification. This hybrid approach leverages the strengths of both algorithms, with CNN capturing intricate patterns and KNN using the extracted features for classification [
46]. This idea is used in our localization system. The backbone CNN creates descriptors, which hereinafter are also called “embeddings”, directly from the omnidirectional images, avoiding the additional computations required to obtain undistorted panoramic images. The KNN algorithm uses the embeddings that encode the most salient features of the observed places, to find in the database (i.e., the global map) the images that best match the current observation. In
Section 4, we demonstrate that the accuracy of localization with raw catadioptric images is at least as good as with the converted panoramic images, while it demands less computing power.
The preparation of the CNN model is based on training the network to correctly recognize places, with the specific aim of training the higher layers of the network to extract feature maps specific to each location properly. Because the CNN used as the backbone of our system is pretrained on images unrelated to the target domain (ImageNet dataset [
47] was used in pretraining), the network was fine-tuned before use by unfreezing several layers and training on the target domain images using cross-entropy as a loss function. Cross-entropy defines the distance between two probability distributions according to the equation:
where
—the actual location;
—the location obtained via a neural network;
—the probability distribution of the actual location;
—the probability distribution of the location determined via a neural network (prediction).
At first, the images are processed by the trained convolutional neural network, from which the output layer was removed, to obtain descriptors (in the form of embedding vectors) that describe the global characteristic features of each image in the database, i.e., each unique place visited by the robot. In this way, a global map of all locations based on reference images is created. Not all convolutional network architectures from the literature can be used on a robotic onboard computer with fewer computational and memory resources.
This research uses backbone networks from the MobileNet [
48] and EfficientNet [
49] families, which are optimized for mobile devices while ensuring high accuracy with a minimal number of parameters and mathematical operations. The MobileNet model uses depth-separated convolution layers consisting of depth-wise convolution and point-wise convolution. Convolution concerning depth (spatial convolution) is used to apply a single filter for each input channel. In MobileNet V2, a new module with inverted residual structure has been introduced, there are two types of blocks. One is an inverted residual block of width 1. The other one is a block of width 2 to reduce the size of the feature map. There are three layers for both types of blocks. The first layer is a 1 × 1 convolution with the ReLU activation function, and the second layer is a convolution against depth. The third and final layer is another convolution of size 1 × 1, with linear bottlenecks. Residual blocks connect the beginning and end of the convolutional block via a skip connection. Adding these two states allows the network to access previous activations not modified in the convolution block. This approach has proven to be essential for building networks of large depths. In MobileNet V2, the basic convolutional layer is called MBConv and contains an inverted residual block with linear bottleneck and depth-separated convolution, with batch normalization behind each convolutional layer.
The EfficientNet model, which we have selected for our final architecture, can be seen as a further step towards efficiency compared to the MobileNet model. EfficientNet uses a complex model scaling technique based on a set of specified coefficients. Instead of randomly scaling width, depth, or resolution, compound scaling uniformly scales each dimension using some fixed scaling coefficient set. Such scaling only increases the predictive ability of the network by replicating the underlying convolutional operations and structure of the network. EfficientNet uses the MBConv blocks as in the MobileNet V2 network, but with a squeeze-and-excitation (SE—[
50]) block being added. This structure helps reduce the overall number of operations required and the model’s size.
The backbone CNN extracts from the image features that uniquely describe different locations and builds embedding vectors that serve as global image descriptors in our system. In the next step, the algorithm creates an index from the global map, which is used for efficient similarity search. The original images collected by the robot are no longer needed for localization and the obtained global map has a compact form. All operations to produce the global map are performed offline.
Then, to localize the robot, we need to query the global map (database of embeddings) with the descriptor/embedding produced from the current perception of the agent, which boils down to a similarity search task. Similarity search is a typical issue in machine learning solutions using embedding vectors, and becomes increasingly difficult as the vectors’ dimensions and/or size increase. Classic methods for finding similarity between vector-described elements in an extensive database include linear search and search in K-D-trees [
51]. K-D-trees are binary trees used to organize points representing data in a K-dimensional space and allow for a very efficient search of points in that space, including a nearest neighbour (NN) search, which we are interested in [
52].
Each node in the tree represents a K-dimensional point. Each nonleaf node in the tree acts as a hyperplane, dividing the space into two parts. Using a K-D tree for nearest neighbour search involves finding the point in the tree that is closest to a given query point. For this purpose, the algorithm traverses the tree and compares the distance between the query point and points in each leaf node. Starting from the root node, it recursively moves down the tree until it reaches the leaf node, following the same procedure as when inserting a node. Many implementations of the nearest neighbour search using K-D-trees are known in Python, including the very popular SciKit-Learn library. However, for this project, we selected the Facebook AI Similarity Search (Faiss) library [
53], written in C++ with wrappers for Python and support for GPU, which suits our implementation on Nvidia Jetson well. The Faiss library solves our similarity search problem using indexing and searching with the KNN method. Once the index type is selected, the algorithm processes the embedding vectors obtained from the neural network and places them in the index. The index can be stored on disk or in memory, and searching, adding, or removing items to the index can be performed in real-time. In addition, the Faiss library has an autotuning mechanism that scans the parameter space and selects those parameters that provide the best possible search time at a given accuracy.
Place recognition begins by loading the learned CNN model and index of images (map) into memory, and then the captured images (queries) are compared with the previously created image database using the KNN algorithm in the space of embedding vectors. The embeddings are compared using L2 (Euclidean) distance, which has been shown to be more computationally efficient than feature binarization followed by the comparison applying Hamming distance [
36,
54].
Once the similarity between the query image and the map is determined, the results are presented in the form of the image retrieval accuracy and the position error between the query image and the map image determined as the most similar one. As we assume that ground truth positions for all map images and query images are known, as in the COLD dataset [
55], we simply use the Euclidean distance in metric space to quantify this error. The averaged Euclidean distance is used to calculate the position error over an entire experiment involving many queries. The arithmetic mean is calculated over all places according to the equation:
where
—the average position measurement error;
—the number of query images;
—the
coordinate for the ground truth location of the
i-th query image;
—the
coordinate for the estimated location of the
i-th query image;
—the
coordinate of the estimated location of the
i-th query image;
—the
coordinate for the estimated location of the
i-th query image.
The architecture of the localization system shown in its general form in
Figure 1 was tested in several variants differing in the type of neural network used as an extractor of image embeddings and the use of catadioptric camera images directly or images converted to panoramic form. The suitability of the NetVLAD approach in the described system was also investigated. The investigated variants are described in the next section of this paper.
6. Conclusions
The results of the tests of the place recognition software for catadioptric cameras and the edge computing platform allow us to conclude that the proposed neural network architecture and parallel processing make it possible to obtain a real-time localization system that works with raw catadioptric images, despite their distorted nature.
The extensive study of the algorithm of appearance-based localization and comparison of results with similar solutions known from the literature demonstrate that the proposed approach makes it possible to obtain highly descriptive embeddings of the observed locations, and consequently, efficient appearance-based localization.
The most important conclusions, summarizing the remarks discussed in
Section 5, concern the best performance of the EfficientNet V2L CNN backbone for generating the embeddings and the pivotal importance of preparing a well-balanced training set for this network, even if transfer learning with pretraining on a large dataset of general purpose images is used. A practical conclusion is that the not-so-recent and low-cost Nvidia Jetson TX2 embedded computer is enough to run a carefully engineered deep learning system for appearance-based localization. This opens interesting opportunities for developing affordable service and social indoor mobile robots utilizing a catadioptric camera as the main localization sensor.
However, a limitation of the proposed appearance-only approach to global localization is the limited accuracy of the obtained metric position of the robot. This accuracy depends on the density of the global map, because the obtained position of the robot is defined by the known location of the most similar image. If the images were collected close to each other, then the position of the robot can be determined more accurately, but if the distances between the points where the images were captured are large, the accuracy is decreased. This limitation will be addressed in our further research by implementing a neural network that will regress the position of the robot with respect to the reference image retrieved from the map. Further research on this system will also concern the implementation of triplet loss with hard negative mining, as this training scheme turned out to be very effective in a number of localization systems. This training strategy should allow the network to develop more specific features, thus making the localization system more effective in highly repetitive indoor environments.