kRadar++: Coarse-to-Fine FMCW Scanning Radar Localisation

Simple Summary This paper presents a hierarchical approach to place recognition and pose refinement for Frequency-Modulated Continuous-Wave (FMCW) scanning radar localisation. Abstract This paper presents a novel two-stage system which integrates topological localisation candidates from a radar-only place recognition system with precise pose estimation using spectral landmark-based techniques. We prove that the—recently available—seminal radar place recognition (RPR) and scan matching sub-systems are complementary in a style reminiscent of the mapping and localisation systems underpinning visual teach-and-repeat (VTR) systems which have been exhibited robustly in the last decade. Offline experiments are conducted on the most extensive radar-focused urban autonomy dataset available to the community with performance comparing favourably with and even rivalling alternative state-of-the-art radar localisation systems. Specifically, we show the long-term durability of the approach and of the sensing technology itself to autonomous navigation. We suggest a range of sensible methods of tuning the system, all of which are suitable for online operation. For both tuning regimes, we achieve, over the course of a month of localisation trials against a single static map, high recalls at high precision, and much reduced variance in erroneous metric pose estimation. As such, this work is a necessary first step towards a radar teach-and-repeat (RTR) system and the enablement of autonomy across extreme changes in appearance or inclement conditions.


Introduction
The ability to localise in an already visited environment is paramount for robotic autonomy, since it enables long-term operation.A common solution to this operational paradigm is the creation of a map as a digital representation of the traversed environment by the autonomous vehicle (AV).Depending on the detail and the organisation of the information in the map, various representations have been proposed in the last decades, including: metric, topological, or hybrid topometric ones.However, meaningful robotic tasks require only locally accurate "metric" localisation in the form of a rigid-body pose with respect to a prior map [1].The AV will then be able to use feature matching during a live pass in the same locations to localise itself onto the map and inform its motion control.
Consider these feature-based techniques and the criticality of safety in these scenarios.In order for AVs to travel safely at higher speeds or operate in wide-open spaces where there is a dearth of distinct features, the highest level of robust sensing is required.Frequency-Modulated Continuous Wave (FMCW) radar satisfies these requirements, thriving in all environmental conditions (rain, snow, dust, fog, or direct sunlight), providing a 360 • view of the scene, and detecting targets at ranges of up to hundreds of metres with centimetre-scale precision.Indeed, there is a burgeoning interest in exploiting FMCW radar to enable robust mobile autonomy, including ego-motion estimation [2][3][4][5][6][7], localisation [7][8][9][10][11], Simultaneous Localisation and Mapping (SLAM) [12], and scene understanding [13][14][15].However, despite radar's promise to deliver such capabilities, the study of these tasks is only mature for cameras and Light Detection and Rangings (LiDARs), and relatively little attention has been paid to radar for the same application.Figure 1 shows an example of application, where a scan belonging to the current, live pass (blue) is used to localise onto the previously-created map, to which a good match (green) and bad match (red) belong.The system should not only be able to identify (despite opposing viewpoints) the right match (green) but also to infer the displacement (translational and rotational) between the latter and the current position.We define this approach as a hierarchical localisation pipeline in which a rotationally-invariant radar place recognition (RPR) submodule first serves as a coarse localiser to retrieve relevant reference frames and a pose refinement step (precise in both translation and rotation) is secondly applied to the query frame and the portion of the map corresponding to the retrieved frames to estimate the metric pose.Our two-step system relies on a radar place recognition (RPR) module and a downstream pose refinement module.Firstly, consider the RPR module.Given an online query radar scan (blue dot on map and blue-framed radar image), the aim is to retrieve a correct match (green), disregarding the incorrect, although similar, radar scan the map also represents (red) and despite the obvious rotational offset.Secondly, consider the pose refinement module.Here, the rotational offset should not be disregarded and must be recovered.The range-azimuth region marked as interesting between the blue and green scans (dashed lines) is separated (solid lines) by such a rotational offset (significant wrapping around the horizontal axis), as well as a small translational offset (some movement down the vertical axis).Here, the conflict between the requirement that the RPR module be robust to small rotational and translational viewpoint differences and the requirement that the pose refinement module be precise in the same considerations are shown in this paper to be compatible in a framework that ultimately leads to a system which is more precise than RPR alone and more expedient than pose refinement alone.This paper's contribution is the proposal of a solution for such problem and can be summarised as the consolidation of two recent lines of investigation in centimetre-scale motion estimation and large-scale place recognition using FMCW radar.We adopt the vocabulary of "large-scale" from localisation literature, such as Reference [16], which are evaluated on similarly extensive urban datasets up to tens of kilometres rather than the vocabulary of works, such as Reference [1], which are benchmarked on data taken over areas as extensive as hundreds of kilometres.We view this contribution as the enabling methodology for future systems which will rely on FMCW radar as the primary sensor for long-term autonomy-for example, in an anticipated RTR application.
The remainder of this work is structured as follows.Section 2 explores related literature.Section 3 provides the prerequisite detail for the works that we build upon and how they are consolidated in the proposed system in Section 4. Section 5 provides necessary details for implementation, evaluation, and our dataset, and Sections 6 and 7 showcase the system in that experimental setting.Section 8 compares the solution presented here against a comparable solution from the literature.Sections 9 and 10 discuss the significance of the contribution and suggest further avenues for investigation.
The preliminary detail on RPR and radar pose refinement covered in Section 3 is from our earlier publications [2,8], while the marriage of these two systems in Section 4 is the principal contribution of this paper, along with the entirety of the experimental programme (c.f.Section 5) and experimental evidence (c.f.Sections 6 and 7).

Related Work
This section places the contribution in the literature on radar-based odometry and localisation, experience-based navigation, and two-stage topological and metric localisation using onboard sensors.
The system we propose in Section 4 has its behaviour explored in Sections 6 and 7 and is compared in Section 8 against some of the solutions discussed here.

Vision-and LiDAR-Based Lifelong Navigation
In this work, we defer to the evidence provided in the recent work in Reference [2,6,11] in proving the radars perform favourably to cameras and LiDARs in motion estimation and localisation and instead focus our experimental programme on what the proposed system radar-only brings to mobile robotic navigation capabilities.
However, works, such as Reference [17][18][19], demonstrate the power of the teach-and-repeat (TR) operational paradigm in enabling autonomy at scale using stereoscopic and monocular vision, as well as LiDAR.From amongst several approaches, TR has distinguished itself as being an effective solution in a number of domains, from urban autonomous driving [20] to Arctic exploration [17].In this framework, a robot is driven along a desired path in a teach phase, during which it builds a topometric map [21,22] of the world by estimating its relative pose between consecutive keyframes.During a repeat run, localisation is performed by matching features between the live feed and those of the map, estimating the position of the robot relative to the route previously traversed.
The Experience-based Navigation (EBN) framework [20,23] for lifelong navigation builds on from TR systems by investigating the utility of experience to localisation capability over many repeat passes of a changing world.

Radar-Based Mapping and Localisation
Besides its superior range and despite its lower spatial resolution, Millimetre-Wave (MMW) radar often overcomes the shortcomings of laser, monocular, or stereo vision because it can operate successfully in dust, fog, blizzards, and poorly lit scenes [24].In Reference [25], it is shown in the context of a SLAM system that while producing slightly less accurate maps than LiDARs, radars are capable of capturing details, such as corners and small walls.Some mapping and localisation systems using similar-but not identical-radars as that used in this work are described in Reference [26,27].
The supervised learning framework presented in Reference [8,9] and which our work extends uses rotationally-invariant feature extraction and triplet-mining but does not solve for the rigid-body pose of the sensor.The cross-modal radar-satellite works presented in Reference [10,28] do solve for the metric pose.The work presented in this paper, in contrast, does not rely on external services (i.e., satellite imagery) and is presented as a radar-only solution.
In Reference [7], a method is presented for learning to predict robust keypoints for odometry estimation and metric localisation in radar.Our work is distinct in that the topological localisation aspects are trained for precisely that task-not being trained for odometry estimation and then used for localisation (having no specific loss for localisation).Additionally, our metric pose refinement is not differentiable and thus not learned end-to-end.
FMCW dopper-enabled radars have been shown in Reference [29] to be beneficial.In this work, however, we consider range-only radar.

Hierarchical Localisation
Often underpinning the software stack for these systems is a hierarchical localisation pipeline in which visual place recognition (VPR) first serves as a coarse localiser to retrieve relevant reference frames and a pose refinement step is secondly applied to the query frame and the portion of the map corresponding to the retrieved frames to estimate the metric pose.Recent examples in the visual domain include Reference [30][31][32][33].
The work in Reference [34] is related to ours in the sense that it can also be seen as a hierarchical approach to localisation, in the particular case of LiDAR.We see two major distinctions between Reference [34] and our work.Firstly, to achieve rotational invariance we account for the cylindrical nature of the scan formation process in our architecture, while Reference [34] maintains the rotation information and exploits it to extract the yaw displacement from the comparison of the embeddings of two scans.Secondly, in Reference [34], the measure of overlap for a pair of scans is a nonlinear, complex function that takes both embeddings as input rather than a simple distance measure.As such, the query procedure prevents the direct use of efficient searching procedures, such as the k-D-trees we use in our RPR module.
Similarly, we view the transplant of the LiDAR method developed in Reference [35] to the radar domain as in Reference [11] as a hierarchical process.In contrast to this approach, the work presented in this paper uses a learned embedding for RPR as opposed to a mean-reduced ring-key and recovers the full three degree-of-freedom (3DoF) pose of the vehicle as opposed to rotation-alignment alone.

Preliminaries
This section briefly describes, in Sections 3.1 and 3.2, the two enabling subsystems which are essential to the proposed integration of Section 4. Section 3.1 describes the place recognition submodule, which is indepently illustrated in Figure 2. Section 3.2 describes the pose estimation submodule, which is based on an odometry estimation pipeline.

Radar Place Recognition (RPR)
Recent advances in radar-only place recognition capabilities allow us to use the FMCW sensor in lieu of global methods with other sensors, such as Global Positioning System (GPS).Here, we summarise the most salient features of the methodology, shown in Figure 2; the interested reader is referred to Reference [8] for more detail, as well as the experimental evidence of the robustness of this RPR module.
Feature extraction: To learn filters and cluster centres which help distinguish polar radar images for place recognition we use NetVLAD [36] with VGG-16 [37] as a front-end feature extractor-both popularly applied to the place-recognition problem.Importantly, the original implementation of the feature extractor is altered to obtain invariance to the orientation of input radar scans, including: circular padding [38], anti-aliasing blurring [39], and azimuth-wise max-pooling.
Circular padding along the azimuth direction has been applied to each convolutional layer of the VGG architecture.This is done as the polar representation of the assembled Fast Fourier Transform (FFT) returns has no true image boundary along the azimuth axis.The operation brings equivariance to rotation to the feature extraction architecture along the boundary region of the radar scan.[8].The offline stages of the pipeline involve enforcing and discretising the metric space, while the online stages involve inference to represent the place the robot currently finds itself within in terms of the learned knowledge and querying the discretised space, in this case depicted using a Voronoi-like structure, which encodes the trajectory of the robot.Downsample operations lead to high-frequency components of the signal, in this case the radar image, to cause aliasing in the sampled one.For this reason, we apply a Gaussian blur before each downsampling operation in order to reduce this effect and help the natural equivariance of the convolutional operations.
Lastly, we applied a max-pooling operation along the azimuth dimension at the end of the feature extractor.Indeed, so far the architecture has only been equivariant to rotation, meaning that a rotation of the radar scan would correspond to a rotation on the output features.Applying a max-pooling operation, which is inherently invariant to the order of the inputs, would indeed bring this invariance to the architecture itself on the dimension of application, in this case the azimuth one.
Triplet mining: To enforce the metric space, we perform online triplet mining and apply the triplet loss described in Reference [40].
The triplet loss can be described using the Euclidean distance function where a, p, and n are an anchor radar scan, positive example from the same location, and negative example from a dissimilar location, respectively, and f represents the encoding done by the Convolutional Neural Network (CNN) to represent these scans as "embeddings".Examples of these are shown in Figure 1 (blue, green, and red, respectively).Loop closure labels are taken from a groundtruth dataset (c.f.Section 5).Batches are constructed such that there is no overlap of the radar sensing horizon between a candidate radar scan and any anchor scan already sampled for the batch.Online querying: Embedding distance is calculated as in Equation (1) above.For example, the embedding distance between a candidate in the map, c i and the query scan, q, is given by (2) In the deployed subsystem, whenever a new radar scan is available, it is encoded in the multidimensional space by performing inference upon the trained network.As will be discussed in Section 5.3, a nearest neighbour (NN) search retrieves the n closest neighbours, instead of a ball search out to a maximum embedding distance threshold, which has no guarantee on the number of candidates passed to the downstream pose refinement (c.f.Section 3.2).Therefore, the maximum number of queried neighbours is a design parameter and is set in Section 6 by taking into consideration the speed of the pose refinement algorithm (c.f.Section 3.2) which will be executed on pairs of radar scans, formed by the query scan and each of the retrieved neighbours.In Section 6, we show a range of recalls at high precision (c.f.Section 5.2) that are available when performing online queries which are strictly possible with reasonable compute requirements (c.f.Section 5.3).
Training details: Due to memory limitations on our graphical compute hardware, we crop the last 168 range bins and scale the width by a factor of 8 such that the original 400 × 3768 polar radar scans are input to the network with resolution 400 × 450.When finetuning either the original architecture or our proposed modified architecture, we initialise internal weights with the publicly available checkpoint vd16_pitts30k_conv5_3_vlad_preL2_intra_white, corresponding to the best performing model described in Reference [41], which produces embedding vectors of length 4096.As the azimuth axis remains unscaled, this does not affect rotational invariance.Our embedding triplet margin is set to 1.0 and our learning rate schedule applies a linear decay function initialised at 1 × 10 −4 and settling to 5 × 10 −6 at 5000 steps [42].We terminate learning at 500,000 steps in all cases.We use gradient clipping to limit the magnitude of the backpropagated gradients to 80 [43].An L2 vector norm is applied to regularise the weights with a scale of 1 × 10 −7 .We use two one-dimensional Gaussian blur kernels with size 7 and standard deviation of 1.

Pose Refinement
For geometric verification, we rely on the radar pose refinement pipeline described in Reference [2,3], which we summarise here.This algorithm reflects the understanding that real, correctly identified landmarks are the same distance apart in any two radar scans.

Landmark extraction:
The landmark extraction algorithm accepts power-range spectra as inputs and returns a set of landmarks, each specified by its range and azimuth.The core idea is to estimate the signal's noise statistics and then scale the power value at each range by the probability that it corresponds to a real detection.This results in two landmark sets, L 1 and L 2 , for the consecutive radar scans.Examples of these landmark sets are shown in Figure 3, where L 1 and L 2 are shown as the orange and blue sets extracted from (consecutive or non-consecutive) radar scans.These were taken from the train portion of the dataset trajectories discussed later in Section 5.
Motion correction: It is important for implementation purposes to note that the landmarks we use for localising a query scan to one of the candidate scans in the map are extracted during mapping.Therefore, we are able to apply a constant velocity model between motion estimates to reposition landmarks by accounting for motion distortion of an azimuth scan line.This is important considering the relatively low scan rate of a FMCW radar (4 Hz) as compared to the speed of the vehicle (typically 30 km h −1 to 50 km h −1 ).
Data association: The pose refinement algorithm achieves robust point correspondences using high-level information in the radar scan: intuitively, it seeks to find the largest subsets of two pointclouds that share a similar correspondence between pairs of landmarks.First, for every point in the first pointcloud, a potential point match in the second pointcloud is suggested based on a point descriptor that captures the point's distance and angular information.This results in a set of candidate landmark pairs B of size W.In Figure 3, the set B would be comprised of pairs of landmarks from the orange and blue landmark sets, where W would be smaller than or equal to the smaller of L 1 or L 2 . .In (a), the orange landmark set is highly compatible with the blue landmark set.This is because the the scans underlying these sets are taken from very proximal locations.This pair of sets corresponds to the ("bright") compatibility matrix illustrated in Figure 4b.In contrast, (b) shows the same landmark set (orange) alongside a highly incompatible landmark set from a farflung location (blue).This example corresponds to the ("dark") compatibility matrix illustrated in Figure 4c.
Spectral decomposition: Non-negative compatibility scores for each pair of these proposed matches are computed and assigned to the elements of a compatibility matrix, C of size W×W, in the form of: where d i and d j are the distances between a candidate pair i and j.Examples of these compatibility matrices are shown in Figure 4. We state this here to avoid confusion between the trajectory similarity matrices visualised later (e.g., the groundtruth SE(2) or embedding distance matrices discussed in Sections 5 and 6).
The vector m ∈ {0, 1} W is a solution of the data association problem, where m i = 1 for a landmark pair match B{i} that is considered or m i = 0 otherwise.The optimal solution, i.e., the solution that maximises the overall compatibility, can be computed as: As m is discrete, the above maximisation is computationally difficult.So, in line with Reference [44], we relax this constraint to find the continuously-valued u * : In short, the optimal set of matches maximises the overall compatibility; thus, we use the normalised principal eigenvector of the compatibility matrix.We then apply Singular-Value Decomposition (SVD) to retrieve the registration transformation between the two scans [45].

Hierarchical Radar Localisation
Figure 5 provides an illustration of the offline and online stages of the proposed system from a perspective of nodes in an experience map (c.f.Section 4.1 below).Both systems described in Section 3 are used in both stages (mapping and localisation), since they are essential in converting the live data flow into the formats used for comparison.The interaction of these two submodules is made more explicit in Figure 6.The mapping process entails a two-fold discretisation: one made in a topometric space through an experience map and one made in an embedding space through a kD-tree procedure for fast lookups.Each radar scan in the live data is used to estimate the robot's motion during both mapping and localisation, but only a portion of them is actually used for localisation.The scans are chosen to be retained through a decimation process, either based on traversed space or time delay.
In the remainder of this section, we will discuss these steps in more detail.

Mapping
To create a topometric map, our approach is to first decimate the radar scans s i ∈ S m , where S m is the incoming, online sensor stream used for the map creation; the result is a decimated set S m ⊆ S m , S m = ∆(S m ), where ∆(•) is the decimation operation.This step is completely optional and performed with the only purpose of limiting the map size.We approach this task as a combination of two fixed thresholds: the first one in space, δ m , and the second one in time, τ m .We then select a scan s i ∈ S m iff.D(s j , s i ) ≥ δ m and T(s j , s i ) ≥ τ m -where D(•) or T(•) are the distance operators in space and time, respectively, and s j ∈ S m , s j ≺ s i is the last selected radar scan.
The selected locations S m and their relative displacements, depicted as nodes and edges, respectively, are then processed to be embedded in the topometric map.The pose refinement subsystem is the main odometry source for the proposed experiments-although our method does not lose generality if we use an external odometry source.To be clear, this odometry source is not used to propagate localisation candidates in a Montecarlo Localisation (MCL) or particle-filter incremental fashion.Although we expect that this would bolster performance, this paper is concerned with the integration of the RPR and pose refinement modules, and we leave this to future work.
In this view, the map is composed of two separate digital representations: a so-called experience map M and a structured, multidimensional embedding space E .
Experience map: The experience map addresses the need of the robot to integrate the localisation system to any further planning pipeline and has the task of containing and make available the necessary information.This experience map is illustrated as shaded blue in Figure 6.In our case, the nodes in the experience map will be used down the line for pose refinement.As such, we decided to store both the pointcloud and the point descriptors that we need for pose refinement, sacrificing memory for efficiency and speed further ahead.

R, t
< l a t e x i t s h a 1 _ b a s e 6 4 = " r s 6 P J       A block diagram illustrating the interaction of the two main submodules in our system: the radar place recognition (RPR) submodule described in Section 3.1 and the pose refinement module described in Section 3.2.A first drive-by of the route, which may be autonomous or piloted manually, is converted into a map of radar scans and their associated RPR embeddings-vectors of size 4096.Likewise, incoming (live) query radar scans are encoded by the feature extraction Convolutional Neural Network (CNN).The query embedding is associated with some subset of the map which is "close" in embedding space.We have illustrated here one such "closest" map frame.This map candidate and the query frame are then passed through the pose refinement module-a process which involves landmark extraction, data association, and rigid-body motion estimation.Importantly, to trade off online and robustness requirements, the two operating thresholds required must be tuned appropriately, as we cover in Section 6.
Embedding space: Secondly, we use a representation of the map for fast retrieval of localisation candidates.To do so, we transform each radar scans into a multidimensional space E by means of the RPR network.This process is illustrated as the blue networks in Figure 6.A nice property of the multidimensional space is that the closest two points in terms of arithmetic difference, the closest in terms of visual similarity.This property lead us to the possibility to structure E into efficient structures for similarity-i.e., closeness-retrieval.We opted for a kD-tree structure due to the implementation simplicity and its deterministic nature, but other possible alternatives exist in literature [46].

Localisation
The mapping procedure is depicted in Figure 5b.Once the map has been created and the AV is traversing the environment again, the localisation procedure can take place.It is possible, as in Reference [47], to perform mapping and localisation in tandem and to create database content as the map is being used for localisation.We leave this for later work (c.f.Section 10).
For now, let us consider an online radar stream S l , shown left in Figure 6.As first step, we perform a decimation operation ∆(•) to filter the incoming data, resulting in the set S l .Similarly to the mapping procedure, we use a distance and a time thresholds δ l and τ l to limit the computational burden of the localisation process.
The localisation pipeline is further composed by three stages.First, the embedding space E is used to retrieve n closest matches in terms of visual similarity to the current radar scan, then each of them is tested for geometrical similarity and, if positive, the planar displacement between the pair of scans that best match is computed.
Visual Similarity: In this stage, the topological localisation network is used to retrieve from the embedding space candidates which are proximal in the multidimensional embedding space to the current radar scan.As mentioned before, we measure closeness with the euclidean distance || • || 2 and query the balanced kD-tree structure used as map for all the candidates within a certain range E. This is shown as the embedding distance threshold in Figure 6.E is set to limit the search to the nodes with high visual similarity to the online radar scan; nevertheless, since the computational power present on the robot is finite and the map can include many nodes that are visually similar, we can set a maximum number of nodes to be retrieved, N. N is set to match the temporal performances needed by the system, i.e., the maximum number of scans that can be processed in a useful time period by the final stage of the pipeline.
Geometrical Similarity: Once n ≤ N candidates are retrieved from the embedding-space map (c.f.Online querying in Section 3.1), a verification stage is carried out on each of them.To this end, we perform a geometric verification on the two pointclouds extracted from the online, query radar scan and each of the candidates, one at a time.The choice is to perform the geometric verification as second stage since it is orders of magnitude slower than the visual-similarity search.
Since the method proposed in Reference [2] does not include introspection for assessing the fitness of the relative pose solution and [5] requires training a model with supervision by an external sensor, we opt for a radar-only solution that takes advantage of the compatibility matrix C (c.f.Spectral decomposition in Section 3.2).Since C is computed starting by scan-dependent descriptors, to increase the speed of the pipeline, it is possible to cache in the map itself the subproducts of the procedure (cf.Experience map in Section 4.1).
In the design of a quality score s for a pose refinement solution, consider the compatibility matrix C of Equation ( 3).Each compatibility score (for a pair of matching landmarks between consecutive radar scans) is a member of the set (0, 1], where a higher value of compatibility score means a higher confidence that the pairs do actually match.In this scenario, a perfect compatibility matrix would contain 1 for every candidate pair (i, j).
In this light, we can design s of a match between two radar scans as the average value of the compatibility matrix C, i.e., where M is the dimension of the squared compatibility matrix C, defined in Equation (3) on page 7.
We subtract the trace of the matrix from the score since it does not contain any information about the geometrical compatibility of the points between the two scans.The quality score s can then be compared against a threshold Σ to define if the radar-scan pair is indeed a loop closure.This threshold is shown as the quality score threshold in Figure 6.
This quality score can be interpreted as the "brightness" of the compatibility matrix visualisations shown in Figure 4.The corresponding thresholded quantity in the "ScanContext" hierarchy of Reference [11,35] is computed by finding the most likely rotational alignment between radar scans using cosine-similarities.In contrast to rotation alignment alone, our method exploits a metric on the full 3DoF relative pose of the scans, which is itself a byproduct that "ScanContext" does not provide.Additionally, the "ScanContext" method relies directly on radar power returns, which are prone to sensor artefacts (e.g., multipath reflections), while our measure of quality is computed from the compatibility of landmarks which are themselves detected in a manner which is designed to be robust to these effects (c.f.Section 3.2).
Pose Computation: Lastly, once one or more candidates are acknowledged as loop closure-by comparing the value of s against the threshold Σ-the final step of the pose-computation pipeline can be carried out to estimate the displacement between the query scan and the candidates-resulting in the pose solution R, t shown top-right in Figure 6.

Experimental Design
This section describes the experimental setup which is the basis for the results to follow in Sections 6 and 7. Specifically, as this work deals with the novel integration (c.f.Section 4) of two (c.f.Section 3) seminal radar techniques, Section 6 spends some time describing the hyperparameter tuning suitable for effective use of the sensor for the application proposed.To this end, we lay out localisation performance and processing requirements in Sections 5.2 and 5.3.Section 5.1 begins by describing the dataset required for this optimisation, as well as the training and testing of the system as in Section 7.

Dataset
The experiments are performed using data collected from the Oxford RobotCar platform [48].The vehicle, as described in the recently released Oxford Radar RobotCar Dataset [49], is fitted with a CTS350-X Navtech FMCW scanning radar.
Groundtruth Location: The groundtruth database is curated offline to capture the sets of nodes that are at a maximum distance (15 m, see Section 5.2 below for more detail) from a query frame.This is peformed using the same data structures and operations as the discussion around Equation (2) which deals with online querying of the RPR embedding space-but this time in a metric space measured by an inertial system-as the initial phase of our proposed hierarchical localisation system.This has the added benefit of giving us ground truth rigid-body poses, as well as the topological matches, required to train our CNN.Through this, we create a graph-structured database that yields triplets of nodes for training the representation described in Reference [8] and summarised in Section 3.1.
To this end, we adjust the accompanying groundtruth odometry described in Reference [49] in order to build a database of groundtruth locations.We manually selected a moment during which the vehicle was stationary at a common point and trimmed each ground trace accordingly.We also aligned the ground traces by introducing a modest rotational offset.
Trajectory Demarcation: As shown in Figure 7a, each approximately 9 km trajectory in the Oxford city centre was divided into three distinct portions: train, valid, and test.
The network is trained with groundtruth topological matches between two reserved trajectories in the train split -2019-01-10-11-46-21-radar-oxford-10k and 2019-01-10-14-50-05-radar-oxford-10k from ori.ox.ac.uk/datasets/radar-robotcar-dataset/datasets.The valid split selected was quite simple, consisting of two straight periods of driving separated by a right turn.This split is simply used to monitor the losses as the RPR module is learning.
The test split (c.f. Figure 7b), upon which the results presented in Sections 6 to 8 are based, was specifically selected to feature vehicle traversals over portions of the route in the opposite direction; data from this split are not seen by the network during training.To be clear, no part of this test split-including the portion where streets, lanes, and alleys are driven in reverse-has any overlap whatsoever with the train or valid split.This is emphasised in Figure 7a.A and B are straight portions which look very similar, while C is a very uniform-looking roundabout; these properties are represented in the embedding distance (c.f. Figure 8).Moreover, B is driven twice, in the two opposite directions.The true positive boundary (c.f.Section 5.2) is shown green in (c).This 25 m threshold is deemed suitable even in tight roundabouts, such as this, where a larger boundary (blue) would make the location of the vehicle in this tight turn ambiguous.
Portion B of the test set (as shown in Figure 7b) is traversed twice, in opposite directions, as depicted by the off-diagonal positives in Figure 8a.The localisation pipeline can detect the backward traversals, as can be noticed by the low values in embedding distance-Figure 8b-and high quality score-Figure 8c.Furthermore, it is interesting to notice the squared-looking blue portions of the embedding distance: these are portions in the map which are not geometrically contiguous, but are similar nevertheless in embedding space; in particular they are the straight portions of the test set-A and B in Figure 7b-and different sections of the roundabout-C.We additionally observe that the same portions of the map are not as close in quality score, meaning that, even if they look alike, the geometries of the scans do not match.
Foray Reservation: We use in total 30 forays from the Oxford Radar RobotCar Dataset: 2 forays for hyperparameter tuning, • 1 foray for mapping, and 25 forays for localisation.
The results focus on a TR scenario, in which all remaining trajectories in the dataset are localised against a map built from the test set of the first trajectory -2019-01-10-12-32-52-radar-oxford-10k from ori.ox.ac.uk/datasets/radar-robotcar-dataset/datasets -that we did not use for learning or hyperparameter tuning, totalling 25 trajectory pairs (and 26 km of driving) with the same map but a different localisation run.

Localisation Performance Requirements
In the groundtruth SE(2) database, all locations within a 25 m radius of a groundtruth location are considered true positives, whereas those outside are considered true negatives.We consider-due to the long radar sensing horizon on the order of 165 m-that putative matches out to 25 m are well within the capabilities of a downstream radar-only relative pose estimation system as described in Section 3.2.This is further illustrated and discussed in Figure 7c.
As the localisation solution is designed to be used in a software autonomy stack, we consider that false positives are potentially disastrous for motion control and pay heed to recall at 100% precision as the optimal point on the precision-recall curves which follow.
For positional errors, we attribute the bias in error (i.e.,10 m) to the manual alignment (c.f.Groundtruth Location in Section 5.1) and focus our discussion instead on either the stability or variance of these errors.

Online Requirements
While δ m and τ m (c.f. the mapping phase of our proposed system in Section 4.1) do not influence the online performances of the pipeline-if not in a larger or smaller search pool for candidates-the maximim number of topological candidates N and the localisation-stream decimation parameters δ l and τ l (c.f. the localisation phase of our proposed system in Section 4.2) are crucial design parameters.This is due to the desire that the pipeline must have a processing rate which is of the same order of magnitude of the live radar stream.Indeed, although the intermediate results needed by the procedure are calculated during the odometry estimation-e.g., pointclouds and point descriptors-the pose refinement pipeline works at about 7 Hz, while the Navtech radar sensor in use has a sampling frequency of 4 Hz.
We dictate as an online requirement that localisation solutions should be reported with δ l no more than 15 m of linear vehicle motion.This distance serves as it is smaller than the true positive boundary (c.f.Section 5.2).Thus, given a urban vehicle speed limit of 50 km h −1 , the resulting localisation rate is 0.93 Hz (or a delay of 1.08 s-which we approximate to τ l = 1 s-between reported localisation solutions).Under these assumptions, we feed the pose refinement subsystem with the best N = 5 candidates from the topological localisation subsystem to address for the kD-tree lookup time.For simplicity, we set δ m = δ l and τ m = τ l without loss of generality.
Once the five candidates have been geometrically tested, only the candidate which achieved the highest quality score will be used for localisation.For each matrix visualised here, the vertical and horizontal axes are in units measuring the number of frames in the query and reference trajectories, respectively.Note that in (b) frames which are proximal in embedding space are shaded dark, while the inverse is true for (c), where frames with high quality-scores are shaded lightly.It is especially important to note that the quality-score matrix in (c) is not computed in its entirety during online operation (prohibitively computationally expensive) but is shown in full here for illustrative purposes.
Figure 9 shows the resultant families of precision and recall (PR) curves.
When not including verification by the pose refinement module (c.f.Section 3.2), these PR curves give us a sense for how good the embedding metric space illustrated in Figure 2 learned by the radar place recognition CNN (c.f.Section 3.1) is.
Figure 9a represents the curves where all the queried candidates-with embedding distance lower than the set threshold E-are tested for their quality score; Figure 9b, instead, assumes only the five candidates with the lowest embedding distance-also lower than E-are tested for scan-matching.We see no relevant difference between the two families, except that the use of only candidates leads to a lower maximum recall, since many of the candidates are discarded.
Based on a design principle in which we require 100% precision (c.f.Section 5.2) and that online requirements limit the number of scan-match calls to 5 (c.f.Section 5.3) we chose the minimum embedding distance and maximum quality score that lead to the operating point-E = 1.0 and Σ = 0.421, respectively, which are used in the following experiments.With these parameter values, the improvement in recall at 100% precision with geometric verification is on the order of 15%.
So far, we evaluated each candidate singularly, i.e., in the count of positive and negative matches, each node in the localised and mapped trajectory is taken in consideration as separate entity.In a real localisation task, though, we consider a successful localisation if a queried frame from the localisation trajectory has a match in the map.For this reason, Figure 9c shows the PR curves obtained by using only the best quality-score match between the 5 selected one.we can see how our choices for E and Σ in this context are sub-optimal for the task, since it is possible to achieve higher recall either by rising E or lowering Σ.In short, we decided not to rise E or lower Σ as a further layer of robustness so that all the 5 localisation candidates would be actual loop closures.We compare our system (from now on referred as kRadar+SM) against two different baselines.First, we use topological matches alone with the same E threshold (referred as kRadar@1); this comparison will expose the benefit of the pose refinement pipeline as employed in the second stage of the pipeline.Moreover, for more fair comparison, we use the best match in embedding distance with the highest threshold E which leads to 100% precision.The analysis in Figure 9c leads to a value of E = 0.36; therefore we refer to this baseline as kRadar@0.36.This comparison will show how the usage of five candidates will greatly increase the recall of the system.Indeed, while the precision remains at 100%, the recall drops from about 75% to about 10%-as in Figure 9c.Families of precision and recall (PR) curves for hyperparameters tuning as described in Section 6.These curves represent three different strategies for querying first the visual similarity and then the geometric similarity as measured by Euclidean distance and quality score in the embedding space and pose refinement stages of our pipeline.In (a), the legend values show embedding distance thresholds, E, below which all candidates for a match to the query frame are measured for their quality score by pose refinement.The parameter which is varied over each curve is the quality score threshold to take as a "correct" match, Σ, from the most (top) to the least (bottom) strict value."Original" refers to the results reported in Reference [8] with no pose refinement (only RPR), i.e., as Σ = 0.However, as described in Section 5.3, for online operation, we require that only 5 candidates be passed downstream to the pose refinement submodule; therefore, in (b), the 5 closest candidates (also within an embedding distance threshold of E) are used to compute quality scores which must be higher than Σ to be considered a match.We select the lowest embedding distance that achieves highest recall at 100% precision, i.e., E = 1.0, at a quality score threshold of Σ = 0.421; this is based on tuning the system to the "knee" of the PR curve, where we achieve 100% precision.While before we were considering each candidate singularly, in (c) we show the performances of the system considering a successful localisation if the best single candidate-i.e., with the best quality score match from amongst the 5 candidates closest within embedding space-is a positive match.Here, we can achieve much higher recall at 100% precision, but at the cost of sacrificing correctness for all 5 candidates-which may be useful in a multi-particle filtering localisation scheme, for example.

Localisation Performance
In this final results section, we explore an RTR scenario by showing aggregate and metric localisation performance when localising a single experience over a map (c.f.Section 7.1) and then localising the remaining experiences in the month-long dataset against that same map (c.f.Section 7.2).

Localisation of a Single Experience
In this experiment, we evaluate the accuracy of the reported poses over a repeat pass on the one traversal reserved as map -2019-01-10-12-32-52-radar-oxford-10k from ori.ox.ac.uk/datasets/radarrobotcar-dataset/datasets.
In this experiment, our system achieves 100% precision and 19.7% recall, being able to correct the faulty cases of kRadar@1 which achieves 90% precision and 97.3% recall.From Figure 10a, which shows the qualitative results of projecting the localisation results onto the ground-truth map trajectory, one can notice how most of the errors includes a mismatch between portions A and B, or within C (c.f. Figure 7b).Please note that the trajectories in Figure 10a are offset for visualisation purposes only and that these offsets are not applied when computing the metric errors reported in Figure 10b, statistics of metric errors reported in Figure 11, or precision-recall performance reported in Figure 9.
Even if tuned for 100% precision, kRadar@0.36does not achieve it, remaining at 92.3% precision and only 19.7% recall.This result shows that topological localisation solutions alone are hardly enough for a complete system and geometric consistency is an effective way to correct for faulty candidates.
Figure 10b shows the positional deviation of the localisation solutions against the ground-truth position of the localised traversal.This is a further confirm that geometrical compatibility can lead to more robust systems, discarding bad matches even if their embedding distance is lower than the threshold.

Month-Long Localisation
To better confirm the previous results, in this experiment we investigate the utility of a single experience to localisation outlook over many repeat passes.Figure 11 shows the statistics of the positional error as functions of trajectory number-or equivalently, time-using kRadar@0.36(c.f. Figure 11a), kRadar@1 (c.f. Figure 11b) and our proposed system (c.f. Figure 11c).Here, these performance metrics remain relatively constant throughout the month.We assert that no such guarantee is available in camera-or LiDAR-based TR systems (due to appearance change driven by condition) and attribute any noticeable variation to structural variation and scene dynamics (moving cars or people).Specifically, consider that the upper bound for the lateral error for the first localised trajectory (left-most) in Figure 11c is only on the order of 10 1 -minor errors limited to the same street that the car is truly present on-while, for Figure 11a,b, it is on the order of 10 2 to 10 3 -grievous errors on the scale of the extent of the dataset.This severe failure of performance is consistent across dataset logs, or localised trajectories, in Figure 11b, and not limited to this first experience as in Figure 11a.Figure 11c, our proposed system, maintains (maximum) errors on the order of 10 1 consistently across the month-long localisation effort.
As seen also in Figure 12a, in a more compact way, our system exhibits the highest performance in rejecting incorrect matches, thus limiting the positional error.While also the lower threshold in kRadar@0.36increases the ability in rejecting the bad matches, it never reaches the performances of the complete system.Moreover, Figure 12b,c show how kRadar@0.36increases precision to the detriment of recall, while kRadar+SM achieved 100% precision-except a single case with 98.96%-with far less effect on recall.Statistics of the positional error as the localised trajectory is varied over the course of a month and the map is held constant.(a) represents kRadar@0.36(b) kRadar@1 and (c) kRadar+SM.As described in Section 5.2, there is a bias in these errors (of approximately 10 m) attributed to the manual alignment of the groundtruth odometry traces.Despite this, our proposed system is shown in Figure 11c to exhibit tighter variance over the course of this experiment.Please note the logarithmic scale., and recall (c) for the proposed system and the two considered baselines.As described in Section 5.2, there is a bias in these errors (of approximately 10 m) attributed to the manual alignment of the groundtruth odometry traces.Despite this, our proposed system is shown to exhibit tighter variance over the course of this experiment.

Benchmark Comparison
In this section, we briefly examine how the contribution compares against other solutions within reach from the broader community.
The solution with the most readily available public implementation is "ScanContext" [35].This has been adapted to radar in Reference [11].However, only a C++ implementation for LiDAR is publicly available.Therefore, we have used in generating the results presented in this section our own implementation.Specifically, we use a ring key length of 120 and downsample the radar bins to 40-in the same way as presented in Reference [11,35].We show, in Figure 13, a family of pr-curves, each of which uses a different number of NNs, starting from 5 and going up to 50-with the later being used to report the results in Reference [11].These curves are generated for the test portion of the Oxford Radar RobotCar Dataset, as for the results discussed in Section 6 above.Considering that the maps (c.f.Section 4.1) for the test portion typically contain 150 keyframes (spaced out by vehicle motion), the key takeaway from Figure 13 is that, no matter how many NNs we pass from the place recognition stage to the pose refinement stage of the method presented in Reference [11], high recalls at high precisions are not achieved in the fashion we show for our proposed system in Figure 9c.Consider additionally that the method presented in Reference [11] also only recovers pose up to rotation, rather than the full SE(2) relative pose of the vehicle with respect to the map as is available in our method.
Please also note that the precision-recall performance reported for the "ScanContext" method as applied to radar here is worse than that reported in Reference [11].We attribute this to our definition of true and false positives and negatives.In Reference [11], a query's localisation solution is considered correct if the top retrieved candidate is within a certain distance of the actual location of the query.In our work, however, all candidates within that certain distance are considered bonafide matches that must be retrieved.

Conclusions
We have presented an application of recent advances in learning representations for imagery obtained by radar scan formation for place recognition and seminal work in radar pose estimation to an autonomy-enabling hierarchical localisation process.We demonstrated the efficacy of the proposed system on the largest radar-focused urban autonomy dataset collected to date and showed that the proposed system: 1. can be used to bolster the performance of topological localisation by geometric verification, 2. reports accurate poses in a TR scenario, 3. maintains localisation performance over long time scales, and 4. lends itself well to lifelong navigation techniques for improving localisation.Specifically, we suggest a range of sensible methods for tuning the system which are suitable for online operation and achieve over the course of a month of localisation trials against a single static map high recalls at high precision and much reduced variance in erroneous metric pose estimation.Depending on the nature of this tuning, we can achieve 100% precision at 85% recall with the single, best match (see Figure 12b,c).This performance is shown to compare favourably against another hierarchical radar localisation process, namely "ScanContext".
While closed-loop motion control experiments were out of the scope of this paper, we expect that the findings reported here are a strong motivator for exploiting this sensor technology for large-scale mobile robotic navigation in the future, specifically radar teach-and-repeat modes of autonomy.

Future Work
In the future, we plan to retrain, retune, and test the system on the all-weather platform described in Reference [50] and in off-road scenarios as planned in Reference [51].We plan to explore experimental settings which place stress on modern techniques for teach-and-repeat (TR) autonomy using cameras, LiDARs and radars.However, even though the place recognition stage is largely sensor agnostic, considering that this work is radar-specific in the pose refinement stage of its hierarchy, this future experimental programme will revolve around adverse conditions themselves.
We expect that the significance of this work will be persistent and anticipate follow-on investigation of efficient methods for selecting which experience to trust for localisation in a multi-experience TR scenario as in Reference [52], as well as predictive systems for characterising localisation performance in TR envelopes around the taught path as in Reference [53].
Finally, we are interested in the density of experiences in a selective recording of memory and expect (due to the immunity of radar to weather and illumination) that fewer memories would be required to represent places than in comparable vision-based systems [47].

Figure 1 .
Figure 1.Our two-step system relies on a radar place recognition (RPR) module and a downstream pose refinement module.Firstly, consider the RPR module.Given an online query radar scan (blue dot on map and blue-framed radar image), the aim is to retrieve a correct match (green), disregarding the incorrect, although similar, radar scan the map also represents (red) and despite the obvious rotational offset.Secondly, consider the pose refinement module.Here, the rotational offset should not be disregarded and must be recovered.The range-azimuth region marked as interesting between the blue and green scans (dashed lines) is separated (solid lines) by such a rotational offset (significant wrapping around the horizontal axis), as well as a small translational offset (some movement down the vertical axis).Here, the conflict between the requirement that the RPR module be robust to small rotational and translational viewpoint differences and the requirement that the pose refinement module be precise in the same considerations are shown in this paper to be compatible in a framework that ultimately leads to a system which is more precise than RPR alone and more expedient than pose refinement alone.

Figure 2 .
Figure2.The Frequency-Modulated Continuous Wave (FMCW) radar place recognition (RPR) pipeline, from Reference[8].The offline stages of the pipeline involve enforcing and discretising the metric space, while the online stages involve inference to represent the place the robot currently finds itself within in terms of the learned knowledge and querying the discretised space, in this case depicted using a Voronoi-like structure, which encodes the trajectory of the robot.

Figure 3 .
Figure 3. Example landmark sets after detection on radar scans as described in Section 3.2 where in Figure 3a |L 1 | = 1029 (orange) and |L 2 | = 1074 (blue).In (a), the orange landmark set is highly compatible with the blue landmark set.This is because the the scans underlying these sets are taken from very proximal locations.This pair of sets corresponds to the ("bright") compatibility matrix illustrated in Figure4b.In contrast, (b) shows the same landmark set (orange) alongside a highly incompatible landmark set from a farflung location (blue).This example corresponds to the ("dark") compatibility matrix illustrated in Figure4c.

Figure 4 .
Figure 4. Examples of pairwise compatibility matrices, C, which as described in more detail in Section 3.2 are used to solve for the SE(2) motion of the sensor.(a) A trivial example arising from the same radar scan passed to the pose refinement module.(b) An example using two radar scans in close proximity but with some offset.(c) An example where two radar scans are taken from distant locations in the world.From these, it is clear that the "brightness" of the (visualised) compatibility matrix can be used as a quality-score measure for the metric localisation solution as described in Section 4. Importantly, the dimensions of these matrices are determined by the number of matching landmarks between the two input radar scans (after data association).Examples of these landmark sets are shown in Figure 3, where in Figure 4b C is of shape 1029 × 1029 corresponding to W = 1029 of the |L 1 | = 1029 orange and W = 1250 of the |L 2 | = 1074 blue landmarks in Figure 3a.

Figure 5 .
Figure 5.The system is composed of two main phases: mapping (a) and localising (b).The mapping process entails a two-fold discretisation: one made in a topometric space through an experience map and one made in an embedding space through a kD-tree procedure for fast lookups.Each radar scan in the live data is used to estimate the robot's motion during both mapping and localisation, but only a portion of them is actually used for localisation.The scans are chosen to be retained through a decimation process, either based on traversed space or time delay.
z 7 m o y t W v n M I f 2 B 9 / g C Z 5 5 J n < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F y i V h o 3 u h Q U F M 3 + t O L 6 D + K H 2 Z 1 4 = " > A A A B 8 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 k K o s e C F 4 9 V 7 I e 0 o W y 2 k 3 b p Z h N 2 N 4 U S + i u 8 e F D E q z / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k p e N U M W y y W 1 b 2 v V e o P e R x F O I N z u A Q P r q E O d 9 C A J j C I 4 B l e 4 c 1 R z o v z 7 n w s W w t O P n M K f + B 8 / g C o 3 Z B b < / l a t e x i t > live scan < l a t e x i t s h a 1 _ b a s e 6 4 = " w 7 6 r l 2 n 0 e R w F O 4 Q w u I I A r q M E t 1 K E B D A b w D K / w 5 k n v x X v 3 P h a t a 1 4 + c w J / 4 H 3 + A G L 2 j e g = < / l a t e x i t > map < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 f / x 4 4 6 P KC Z 6 B X V P i u c b / A / i I / k = " > A A A B 9 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 k t i B 4 L X j x W s R / Q L iW b z b a h 2 e y a z B Z K 6 e / w 4 k E R r / 4 Y b / 4 b 0 3 Y P 2 v p g 4 O W 9 G T L z g l Q K g 6 7 7 7 R Q 2 N r e 2 d 4 q 7 p b 3 9 g 8 O j 8 v F J y y S Z Z r z JE p n o T k A N l 0 L x J g q U v J N q T u N A 8 n Y w u p 3 7 7 T H X R i T q E S c p 9 2 M 6 U C I S j K K V 7 C s l j K p Q h B R 5 v 1 x x q + 4 C Z J 1 4 O a l A j k a / / N U L E 5 b F X C G T 1 J i u 5 6 b o T 6 l G w S S f l X q Z 4 S l l I z r g X U s V j b n x p 4 u l Z + T C K i G J E m 1 L I V m o vy e m N D Z m E g e 2 M 6 Y 4 N K v e X P z P 6 2 Y Y 3 f h T o d I M u W L L j 6 J M E k z I P A E S C s 0 Z y o k l l G l h d y V s S D V l a H M q 2 R C 8 1 Z P X S a t W 9 a 6 q 7 n 2 t U n / I 4 y j C G Z z D J X h w D X W 4 g w Y 0 g c E T P M M r v D l j 5 8 V 5 d z 6 W r Q U n n z m F P 3 A + f w C b F p I J < / l a t e x i t > map candidate < l a t e x i t s h a 1 _ b a s e 6 4 = " T y N 8 G 5 0 L H f S N I a j l t g T N d a T r M i Q = " > A A A C C X i c b V D L S s N A F J 3 4 r P E V d e l m s A i u S l I Q X R b c u K x i H 9 C E M p l M 2 q H z C D M T M Z R u 3 f g r b l w o 4 t Y / c O f f O G 2 z 0 N Y D A 4 d z 7 r 1 z 7 4 k z R r X x / W 9 n Z X V t f W O z s u V u 7 + z u 7 X s H h 2 0 t c 4 V J C 0 s m V T d G m j A q S M t Q w 0 g 3 U w T x m J F O P L q a + p 1 7 o j S V 4 s 4 U G Y k 4 G g i a U o y M l f o e D I W k I i H C u F j y z K o x Z d Q U Y e h y Z B R 9 6 H t V v + b P A J d J U J I q K N H s e 1 9 h I n H O 7 U j M k N a 9 w M 9 M N E b K U M z I x A 1 z T T K E R 2 h A e p Y K x I m O x r N L J v D U K g l M p b J P G D h T f 3 e M E d e 6 4 L G t t O s N 9 a I 3 F f / z e r l J L 6 M x F V l u i M D z j 9 K c Q S P h N B a Y U E W w Y Y U l C C t q d 4 V 4 i B T C x o b n 2 h C C x Z O X S b t e C 8 5 r / k 2 9 2 r g t 4 6 i A Y 3 A C z k A A L k A D X I M m a A E M H s E z e A V v z p P z 4 r w 7 H / P S F a f s O Q J / 4 H z + A A K P m p M = < / l a t e x i t > compatibility matrix < l a t e x i t s h a 1 _ b a s e 6 4 = " V 3 2 o b E P w l 1 9 e J e 1 6 z b + s e X f 1 a u O + i K M E p 3 A G F + D D F T T g F p r Q A g I C n u E V 3 h z l v D j v z s e i d c 0 p Z k 7 g D 5 z P H x b F k K M = < / l a t e x i t > landmarks < l a t e x i t s h a 1 _ b a s e 6 4 = " H r 0 s q s p G H q z 7 8 X M p p v J u / R + 3 B Y c = " > A A A B 6 3 i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e w G R I 8 B L x 6 j m A c k S 5 i d z C Z D 5 r H M z A p h y S 9 4 8 a C I V 3 / I m 3 / j b L I H T S x o K K q 6 6 e 6 K E s 6 M 9 f 1 v r 7 S x u b W 9 U 9 6 t 7 w A S e 4 R X e P O G 9 e O / e x 7 K 1 5 B U z p / A H 3 u c P Q L e O a w = = < / l a t e x i t > pose < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 N B H 4 M 0 a 4 x j C L M J a 1 E i d M P J I n s k r e X O e n B f n 3 f m Y l + a c R c 8 x + Q P n 8 w d 4 + p x q < / l a t e x i t > quality score threshold < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 G i F m n Q 2 s p m N z + b a U W C 0 q J n O r r 0 = " > A A A C E 3 i c b V D L S s N A F J 3 4 r PE V d e k m W A R x U Z K C 6 L L g x m U V + 4 A m l M n k p h 0 6 m Q k z E 6 G E / o M b f 8 W N C 0 X c u n H n 3 z h t s 9 D W A w O H c + 5 j 7 o k y R p X 2 v G 9 r Z X V t f W O z s m V v 7 + z u 7 T s H h 2 0 l c k m g R Q Q T s h t h B Y x y a G m q G X Q z C T i N G H S i0 f X U 7 z y A V F T w e z 3 O I E z x g N O E E q y N 1 H f O A y 4 o j 4 F r G 9 I I 4 p j y Q R D Y s d m M O Q F D 9 V C C G g o W 9 5 2 q V / N m c J e J X 5 I q K t H s O 1 9 B L E i e m u G E Y a V 6 v p f p s M B S U 8 J g Y g e 5 g g y T E R 5 A z 1 C O U 1 B h M b t p 4 p 4 a J X Y T I c 3 j 2 p 2 p v z s K n C o 1 T i N T m W I 9 V I v e V P z P 6 + U 6 u Q o L y r N c A y f z R U n O X C 3 c a U B u T C U Q z c a G Y C K p + a t L h l h i o k 2 M t g n B X z x 5 m b T r N f + i 5 t 3 W q 4 2 7 M o 4 K O k Y n 6 A z 5 6 B I 1 0 A 1 q o h Y i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T E v X b H K n i P 0 B 9 b n D y k C n m E = < / l a t e x i t > embedding distance threshold < l a t e x i t s h a 1 _ b a s e 6 4 = " b p

Figure 6 .
Figure6.A block diagram illustrating the interaction of the two main submodules in our system: the radar place recognition (RPR) submodule described in Section 3.1 and the pose refinement module described in Section 3.2.A first drive-by of the route, which may be autonomous or piloted manually, is converted into a map of radar scans and their associated RPR embeddings-vectors of size 4096.Likewise, incoming (live) query radar scans are encoded by the feature extraction Convolutional Neural Network (CNN).The query embedding is associated with some subset of the map which is "close" in embedding space.We have illustrated here one such "closest" map frame.This map candidate and the query frame are then passed through the pose refinement module-a process which involves landmark extraction, data association, and rigid-body motion estimation.Importantly, to trade off online and robustness requirements, the two operating thresholds required must be tuned appropriately, as we cover in Section 6.

Figure 7 .
Figure 7. (a) train (blue), valid (black), and test (magenta) splits of the full Oxford Radar RobotCar Dataset.Red portions serve as padding between splits to account for the long range of the radar.(b) Example of test split for one of the traversals.A and B are straight portions which look very similar, while C is a very uniform-looking roundabout; these properties are represented in the embedding distance (c.f.Figure8).Moreover, B is driven twice, in the two opposite directions.The true positive boundary (c.f.Section 5.2) is shown green in (c).This 25 m threshold is deemed suitable even in tight roundabouts, such as this, where a larger boundary (blue) would make the location of the vehicle in this tight turn ambiguous.

Figure 8 .
Figure 8. Graphical representation of the SE(2) distance with 25 m threshold applied (c.f.Section 5.2)to form a binary mask used as ground-truth (a), embedding distance (b) and quality score (c) for each pair of radar scans in two separate trials.For each matrix visualised here, the vertical and horizontal axes are in units measuring the number of frames in the query and reference trajectories, respectively.Note that in (b) frames which are proximal in embedding space are shaded dark, while the inverse is true for (c), where frames with high quality-scores are shaded lightly.It is especially important to note that the quality-score matrix in (c) is not computed in its entirety during online operation (prohibitively computationally expensive) but is shown in full here for illustrative purposes.

Figure 9 .
Figure 9. Families of precision and recall (PR) curves for hyperparameters tuning as described in Section 6.These curves represent three different strategies for querying first the visual similarity and then the geometric similarity as measured by Euclidean distance and quality score in the embedding space and pose refinement stages of our pipeline.In (a), the legend values show embedding distance thresholds, E, below which all candidates for a match to the query frame are measured for their quality score by pose refinement.The parameter which is varied over each curve is the quality score threshold to take as a "correct" match, Σ, from the most (top) to the least (bottom) strict value."Original" refers to the results reported in Reference[8] with no pose refinement (only RPR), i.e., as Σ = 0.However, as described in Section 5.3, for online operation, we require that only 5 candidates be passed downstream to the pose refinement submodule; therefore, in (b), the 5 closest candidates (also within an embedding distance threshold of E) are used to compute quality scores which must be higher than Σ to be considered a match.We select the lowest embedding distance that achieves highest recall at 100% precision, i.e., E = 1.0, at a quality score threshold of Σ = 0.421; this is based on tuning the system to the "knee" of the PR curve, where we achieve 100% precision.While before we were considering each candidate singularly, in (c) we show the performances of the system considering a successful localisation if the best single candidate-i.e., with the best quality score match from amongst the 5 candidates closest within embedding space-is a positive match.Here, we can achieve much higher recall at 100% precision, but at the cost of sacrificing correctness for all 5 candidates-which may be useful in a multi-particle filtering localisation scheme, for example.

Figure 10 .
Figure 10.(a) Qualitative view of the localisation process.The tracks have been offset for visualisation purposes.(b) Positional errors as the localised trajectory one single trial.kRadar@0.36has been labelled with crosses due to its low recall.Please notice the logarithmic scale.

Figure 11 .
Figure 11.Statistics of the positional error as the localised trajectory is varied over the course of a month and the map is held constant.(a) represents kRadar@0.36(b) kRadar@1 and (c) kRadar+SM.As described in Section 5.2, there is a bias in these errors (of approximately 10 m) attributed to the manual alignment of the groundtruth odometry traces.Despite this, our proposed system is shown in Figure11cto exhibit tighter variance over the course of this experiment.Please note the logarithmic scale.

Figure 12 .
Figure 12.Statistics for positional error (a), precision (b), and recall (c) for the proposed system and the two considered baselines.As described in Section 5.2, there is a bias in these errors (of approximately 10 m) attributed to the manual alignment of the groundtruth odometry traces.Despite this, our proposed system is shown to exhibit tighter variance over the course of this experiment.

Figure 13 .
Figure13.A family of pr curves for the method of Reference[11] as run on the dataset collected in central Oxford.