Place recognition: An Overview of Vision Perspective

Place recognition is one of the most fundamental topics in computer vision and robotics communities, where the task is to accurately and efficiently recognize the location of a given query image. Despite years of wisdom accumulated in this field, place recognition still remains an open problem due to the various ways in which the appearance of real-world places may differ. This paper presents an overview of the place recognition literature. Since condition invariant and viewpoint invariant features are essential factors to long-term robust visual place recognition system, We start with traditional image description methodology developed in the past, which exploit techniques from image retrieval field. Recently, the rapid advances of related fields such as object detection and image classification have inspired a new technique to improve visual place recognition system, i.e., convolutional neural networks (CNNs). Thus we then introduce recent progress of visual place recognition system based on CNNs to automatically learn better image representations for places. Eventually, we close with discussions and future work of place recognition.


INTRODUCTION
Place recognition has attracted a significant amount of attention in computer vision and robotics communities, as evidenced by related citations and a number of workshops dedicated to improve long-term robot navigation and autonomy.It has a number of applications ranging from autonomous driving, robot navigation to augmented reality, geo-localizing archival imagery.
The process of identifying the location of a given image by querying the locations of images belonging to the same place in a large geotagged database, usually known as place recognition, is still an open problem.One major characteristic that separates place recognition from other visual recognition tasks is that place recognition has to solve condition invariant recognition to a degree that many other fields haven't.How can we robustly identify the same real-world place undergoing major changes in appearance, e.g., illumination variation (Figure 1), change of seasons (Figure 1) or weather, structural modifications over time, and viewpoint change.To be clear, above changes in appearance are summarized as conditional variations, but excludes the viewpoint change.Moreover, how can we distinguish true images from similarly looking images without supervision?Since collecting geotagged datasets is time-consuming and labor intensive, and situations like indoor places do not have GPS information necessarily.Place recognition task has been traditionally cast as an image retrieval task Yu et al. (2015b) where image representations for places are essential.The fundamental scientific question is that what is the appropriate representation of a place that is informative enough to recognize real-world places, yet compact enough to satisfy the real-time processing requirement on a terminal, such as mobile phone and a robot.
At early stages, place recognition was dominated by sophisticated local-invariant feature extractors such as SIFT Lowe (2004) and SURF Bay et al. (2006), hand-crafted global image descriptors such as GIST Oliva and Torralba (2001;2006), and bag-of-visual-words Philbin et al. (2007); Sivic et al. (2003) approach.These traditional feature extraction techniques have led to a great step towards the ultimate outcome.
Recent years have witnessed a prosperous advancement of visual content recognition using a powerful image representations extractor łł Convolutional Neural Networks (CNNs) Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Yu et al. (2017b;a), which sets state-of-the-art performance on many category-level recognition tasks such as object classification Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Christian et al. (2014), scene recognition Zhou et al. (2014); ?, image classification Yu et al. (2012).Principle ideas of CNNs can date back to 1980s, and the major two reasons why CNNs are so successful in computer vision are the advances of GPU-based computation power and data volume respectively.Recent studies show that general features extracted by CNNs can be transferable Sharif Razavian et al. ( 2014) and generalized well to other visual recognition tasks.Semantic gap is a well-known problem in place recognition, where different semantics of places may share common low-level features extracted by SIFT, e.g., colors, textures.Convolutional Neural Networks may bridge this semantic gap by treating an image as an high-level feature vector, extracted through deep stacked layers.
This paper provides an overview of both traditional and deep learning based descriptive techniques widely applied to place recognition task, which by no means is exhaustive.The remainder of this paper is organized as follows : section 1 gives an introduction of place recognition literature.Section 2 talks about local and global image descriptors widely applied to place recognition.Section 3 presents a brief view of convolutional neural networks and corresponding techniques used in place recognition.Section 4 discusses the future work in place recognition.

TRADITIONAL IMAGE DESCRIPTORS 2.1 LOCAL IMAGE DESCRIPTORS
Local feature descriptors such as SIFT Lowe (2004) and SURF Bay et al. (2006) have been widely applied to visual localization and place recognition tasks.They achieve an outstanding performance on viewpoint invariance.SIFT and SURF describes the local appearance of individual patches or key points within an image, for an image, T represents the set of local invariant features, in which x i ∈ R d is the i th local feature.The length of X depends on the number of key points of the image.These local invariant features are then usually aggregated into a compact single vector for the entire image, using techniques such as bag-of-visual-words Philbin et al. (2007); Sivic et al. (2003), VLAD Jégou et al. (2010) and Fisher kernel Jaakkola et al. (1999); Perronnin and Dance (2007).
Since local feature extraction consists of two phases : detection and description, a number of variations and extensions of techniques for the two phases are developed.For example, Mei et al. (2009) used FAST Rosten and Drummond (2006) to detect the interested patches of an image, which were then described by SIFT.Churchill and Newman (2013) used FAST as well during detection phase, whereas they use BRIEF Calonder et al. (2012) to describe the key points instead of SIFT.
State-of-the-art visual simultaneous localization and mapping (SLAM) systems such as FAB-MAP Cummins and Newman (2008;2011) used bag-of-visual-words to construct the final image representation.Each image may contain hundreds of local features, which is impractical in large-scale and real-time processing place recognition task, moreover, they require an enormous amount of memory to store the high dimensional features.The bag-of-visual-words mimics the bag-of-words technique used in efficient text retrieval field.It typically needs to form a codebook C = [c 1 , c 2 , ..., c k ] with K visual words, each visual word c i ∈ R d is a centroid of a cluster, usually gained by k-means Kanungo et al. (2002).Then each local invariant feature x i is assigned to its nearest cluster centroid, i.e., the visual word.A histogram vector of k dimension containing the frequency of each visual word being assigned can be formed this way.The bag-of-visual-words model and local image descriptors generally ignore the geometric structure of the image, that is, different orders of local invariant features in X will not impact on the histogram vector, thus the resulting image representation is viewpoint invariant.There are several variations on how to normalize the histogram vector, a common choice is L 2 normalization.Components of the vector are then weighted by inverse document frequency (idf).However, the codebook is dataset dependent and needs to be retrained if a robot moves into a new region it has never seen before.The lack of structural information of the image can also weaken the performance of the model.Fisher Kernel proposed by Jaakkola et al. (1999); Perronnin and Dance ( 2007) is a powerful tool in pattern classification combining the strengths of generative models and discriminative classifiers.It defines a generative probability model p, which is the probability density function with parameter λ.Then one can characterize the set of local invariant features T with the following gradient vector : where the gradient of the log-likelihood describes the direction in which parameters should be modified to best fit the observed data intuitively.It transforms a variable length sample X into a fixed length vector whose size is only dependent on the number of parameters in the model.Perronnin and Dance (2007) applied Fisher kernel in the context of image classification with a Gaussian Mixture Model to model the visual words.In comparison with the bag-of-visual-words representation, they obtain a (2d + 1) * k − 1 dimensional image representation of a local invariant feature set, while k dimensional image representation using bag-of-visual-words.Thus, fisher kernel can provide richer information under the circumstance that their size of codebook is equal, or, fewer visual words are required by this more sophisticated representation.Jégou et al. (2010) propose a new local aggregation method called Vectors of locally aggregated descriptors (VLAD), which becomes state-of-the-art technique compared to bag-of-visual-words and fisher vector.The final representation is computed as follows: The VLAD vector is represented by v i,j where the indices i = 1...k and j = 1...d respectively index the i th visual word and the j th component of local invariant feature.The vector is subsequently L 2 normalized.We can see that the final representation stores the sum of all the residuals between local invariant features and its nearest visual word.One excellent property of VLAD vector is that it's relatively sparse and very structured, Jégou et al. (2010) show that a principle component analysis is likely to capture this structure for dimensionality reduction without much degradation of representation.They obtain a comparable search quality to bag-of-visual-words and Fisher kernel with at least an order of magnitude less memory.

GLOBAL IMAGE DESCRIPTORS
The key difference between local place descriptors and global place descriptors is the presence of detection phase.One can easily figure out that local place descriptors turns into global descriptors by predefining the key points as the whole image.WI-SURF Badino et al. (2012) used whole-image descriptors based on SURF features and BRIEF-GIST Sünderhauf and Protzel (2011) used BRIEF Calonder et al. (2012) features in a similar whole-image fashion.
A representative global descriptor is GIST Oliva and Torralba (2001;2006).It has been shown to suitably model semantically meaningful and visually similar scenes in a very compact vector.The amount of perceptual and semantic information that observers comprehend within a glance (around 200ms) refers to the gist of the scene, termed Spatial Envelope properties, it encodes the dominant spatial layout information of the image.GIST uses Gabor filters at different orientations and different frequencies to extract information from the image.The results are averaged to generate a compact vector that represents the gist of a scene.Murillo and Kosecka ( 2009) applied GIST into large-scale omnidirectional imagery and obtains a nice segmentation of the search space into clusters, e.g., tall buildings, streets, open areas, mountains.Siagian and Itti (2009) followed a biological strategy which first computes the gist of a scene to produce a coarse localization hypothesis, then refine it by locating salient landmark points in the scene.
The performance of techniques described in section 2 mainly depends on the size of codebook C, if too small, the codebook will not characterize the dataset well, on the contrast, if the size is too large, it will require huge computational resources plus time.While global image descriptors have their own disadvantages, they usually assume that images are taken from a same viewpoint.

CONVOLUTIONAL NEURAL NETWORKS
Recently Convolutional Neural Networks achieve state-of-the-art performance on various classification and recognition tasks, e.g., handwriting digits recognition, object classification Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Christian et al. (2014), scene recognition Zhou et al. (2014); Yuan et al. (2015).Features extracted from convolutional neural networks trained on very large datasets significantly outperforms SIFT on a variety of vision tasks Krizhevsky et al. (2012); Fischer et al. (2014).The core idea behind CNNs is the ability to automatically learn high-level features trained on a significant amount of data through deep stacked layers in an end-to-end manner.It works as a function f (•) that takes some inputs such as images and output the image representations characterized by a vector.A common CNN model for fine-tuning is vgg-16 Simonyan and Zisserman (2014), the architecture can be seen from Figure 3.For an intuitive understanding of what the CNN model learns in each layer, please check Sünderhauf et al. (2015); Arandjelović et al. (2016) for heatmap graphical explanation.Chen et al. (2014) was the first work to exploit CNN in place recognition system as a feature extractor.They used a pre-trained CNN called Overfeat Sermanet et al. (2013), which is originally proposed for the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and proved that advantages of deep learning can shift to place recognition task.Fischer et al. (2014) provides an investigation on the performance of CNN and SIFT on a descriptor matching benchmark.Sünderhauf et al. (2015) comprehensively evaluates and compares the utility and viewpoint-invariant properties of CNNs.Theyve shown that features extracted from middle layers of CNNs have a good robustness against conditional changes, including illumination change, seasonal and weather changes, while features extracted from top layers are more robust to viewpoint change.Features learnt by CNNs are proved to be versatile and transferable, i.e., even though they were trained on a specific target task, they can be successfully deployed for other problems and often outperform traditional hand-engineered features Sharif Razavian et al. (2014).However, their usage as black-box descriptors in place recognition has so far yielded limited improvements.For example, visual cues that are relevant for object classification may not benefit the place recognition task.Arandjelović et al. (2016) design a new CNN architecture based on vgg16 and the VLAD representation, they remove the all the fully connected layers and plug the VLAD layers into it by making it differentiable.
The loss function used in this paper is triplet loss, which can be seen in many other recognition tasks.Zhou et al. (2014) gathered a huge scene-centric dataset called "Places" containing more than 7 million images from 476 scene categories.However, scene recognition is fundamentally different from place recognition.Chen et al. (2017) creates for the first time a large-scale place-centric dataset called SPED containing over 2.5 million images.
Though CNNs has the power to extract high-level features, we are far away from making full use of it.How to gather a sufficient amount of data for place recognition, how to train a CNN model in an end-to-end manner to automatically choose the optimal features to represent the image, are still underlying problems to be solved.

DISCUSSION AND FUTURE WORK
Place recognition has made great advances in the last few decades, e.g., the principles of how animals recognize and remember places and relationship between places in a neuroscience perspective Lowry et al. (2016), a new way of describing places using convolutional neural networks, a number of datasets specifically for places are put forward.However, we are still a long way from a robust long-term visual place recognition system which can be well applied to a variety of scenarios of real-world places.Hence, we highlight several promising avenues of ongoing and future research that are leading us closer to this ultimate outcome.
Place recognition is becoming a hot research field and benefiting from related ongoing works in other fields, especially the enormous successes achieved in computer vision through deep learning technique, e.g., image classification, object detection, scene recognition.While features extracted from pre-trained CNNs on other vision tasks are shown to be transferable and versatile, theyve so far yielded unsatisfactory performance on place recognition task.There is a high possibility that we still dont fully exploit the potential of CNNs, we can improve the performance in two aspects.First, gather a sufficient amount of place-centric data covering various environments including illumination, weather, structure, season and viewpoint change.One alternative and reliable source is Google Street View Time Machine.If you train a CNN on a small-size dataset, the model usually works awful on other datasets since place recognition is dataset-dependent.And one needs to retrain it when a new dataset is fed into it.Since one of advantages of CNNs is to extract representative features through Big data, state-of-the-art performance can be improved.Second, an optimized CNN architecture for place recognition task.Real-world images from cameras are usually high resolution, whereas in many cases one needs to downscale the original images.For example, the input size of vgg-16 is 224*224.Moreover, an architecture that is well-suited for object detection may not fit well into place recognition task since their visual cues are different.And designing a good loss function is essential to features.Developments focused on the above two problems will further improve the robustness and performance of existing deep learning-based place recognition techniques.
Place recognition system can also benefit from ongoing researches of object detection, scene classification Yu et al. (2013) and scene recognition Yu et al. (2015a).Semantic context from scene, interpreted as the gist of the scene, can help to partition the search space when comparing similarity between image representations, which ensures scalability and real-time processing towards real-world application.Note that different places may share a common semantic concept, which needs a furthermore and precise feature mapping procedure.Objects such as pedestrian and trees should be avoided, while objects like buildings and landmarks are important for long-term place recognition.Automatically determining and suppressing features that leads to confusion to visual place recognition systems will also improve the place recognition performance.

Figure 2 :
Figure 2: Frames extracted from the Nordland dataset Sünderhauf et al. (2013) that belong to the same place in spring, summer, fall, winter.

Figure 3 :
Figure 3: VGG-16 configuration (shown in columns).The depth of the configuration increases from the left(A) to the right(E), as more layers are added(the added layers are shown in bold).The convolutional layer parameters are denoted as "conv(receptive field size)-(number of channels)".The ReLU activation function is omitted for simplicity.