Estimating Road Segments Using Kernelized Averaging of GPS Trajectories

: A method called iTEKA, which stands for iterative time elastic kernel averaging, was successfully used for averaging time series. In this paper, we adapt it to GPS trajectories. The key contribution is a denoising procedure that includes an over-sampling scheme, the detection and removal of outlier trajectories, a kernelized time elastic averaging method, and a down-sampling as post-processing. The experiment carried out on benchmark datasets showed that the proposed procedure is effective and outperforms straightforward methods based on medoid or Euclidean averaging approaches.


Introduction
During the last decade, the development of navigational and geolocation systems and applications has experienced strong growth.For the associated services, such as human behavior analysis, traffic modelization and prediction, smart city information services, geo-localized and contextualized recommendation, etc., to be exploitable in urban areas, it is necessary to rely on an up-to-date cartography.However, maintenance of road and pedestrian network maps requires costly manual editing in time and money.This need has spawned a specific research theme around the development of automated extraction algorithms of road network maps from GPS trajectories Shi et al. (2009); Mariescu-Istodor andFränti (2017, 2018).Global Position Systems (GPS) trajectories are easily and cheaply collected using consumer embedded equipment, such as smart-phones.Unfortunately, they are in general noisy, mainly due to the sensitivity of the GPS tracking system that is used, but also due to the fluctuation, or loss, of the signal from the satellites.Consequently, in many places, such as in cities or mountain environments, due to reflections, magnetic interferences or even tropospheric conditions or sun activities, GPS trajectories can be erroneous (more than 10 m deviations) or characterized with missing data, making them unsuited for applications without dedicated preprocessing.
When a single trajectory is at hand, Kalman filtering Welch and Bishop (1995) is usually the classical approach used to clean up this kind of sequential data.When, instead of a single trajectory, a set of trajectories can be considered, such as the random realization of a similar path followed by a population of pedestrians or road vehicles, one can consider ensemble filtering approaches such as ensemble Kalman filtering Evensen and van Leeuwen (2000) or variant of particle filtering allowing to cope with historical data Panangadan and Talukder (2010).
Both Kalman and ensemble Kalman methods require jointly the estimation of a measurement model and a dynamical model.However, the inference of these models are difficult to estimate with accuracy, specifically when the noise is non-Gaussian, and, furthermore, the parameters of the models may change with time and space, from one segment to the other.
In this article, we address the problem of cleaning sets of pedestrian or vehicle GPS trajectories corresponding to a road segment without making any assumption on the noise or the nonlinear dynamics underlying the movement of the tracked object.The cleaning procedure that we present relies on an ensemble filtering algorithm for sets of trajectories mostly based on 1 the notion of centroid defined for a subset of time series.It involves five steps, as depicted in Figure 1: 1. an over-sampling of the trajectories such that they all share a higher sampling rate, namely they are described with the same higher number of samples (Section 3.1); Figure 1: Processing pipeline overview: The meta parameters ν (iTEKA averaging method) and γ (soft-DTW averaging method) are defined, respectively, in Equations ( 5) and ( 8).The T meta parameter is the length of the trajectories when they have been over-resampled during the preprocessing step and is defined in Section 3.1.The τ parameter is a threshold involved to decide whether a trajectory is an outlier and is defined in Section 3.2.
This overall cleaning procedure was used during our experimentation to test comparatively various methods from the state of the art.We evaluated: (i) four medoid methods based on the Euclidean distance, the well-known dynamic time warp (DTW) measure Velichko and Zagoruyko (1970); Sakoe and Chiba (1971), the soft version of DTW (soft-DTW, Cuturi and Blondel (2017)) or a kernelization version of DTW (K rdtw , Marteau and Gibet (2014)); and (ii) four centroid based methods, namely the Euclidean centroid Hausner (1965Hausner ( ,1998)), the DTW Barycenter Averaging (DBA) centroid according to the method proposed in Petitjean et al. (2014), the soft-DTW centroid developed by Cuturi and Blondel (2017) and the iterative Time Elastic Kernelized Averaging proposed (iTEKA) in Marteau (2019).
Section 2 introduces the main concepts that are behind the DTW time elastic measure and its kernalization.It also introduces the general purpose (iterative Time Elastic Kernel Averaging (iTEKA) procedure that has been specifically developed to average sets of time series.Section ?? addresses the averaging of sets of GPS trajectories.It details mostly the preprocesing and postprocessing steps as well as the trajectory outlier detection and removal stage.Section 4 presents an experimentation carried out in the context of the "Averaging GPS segments Competition" (https://cs.uef.fi/sipu/segments/)Fränti and Mariescu-Istodor (2019) proposed by University of Eastern Finland.A conclusion ends this study.
2 From Dynamic Time Warping to Time Elastic Kernels Averaging

Dynamic Time Warping
Dynamic Time Warping (DTW) was introduced in Velichko and Zagoruyko (1970); Sakoe and Chiba (1971) as a measure of similarity between time series.DTW similarity is the results of an optimal alignment path π * between a pair of time series (originally speech waves) while locally considering expansion or squeezing of the time line.An alignment path π of length |π| = m between two time series x and x' is defined as the sequence of m (max(|x|, |x |) ≤ m ≤ |x| + |x |) pairs of aligned time stamps: where (π 1 (k), π 2 (k)) means that x π1(k) and x π2(k) are aligned.π 1 and π 2 obey the boundary and monotonicity conditions as: The eligible alignments paths are classically represented in a |x| × |x | grid, as displayed in Figure 2. Let A be the set of all possible alignments between two time series.The DTW similarity measure between time series x and x' is thus defined as: where ϕ : R × R → R + is a local distance measure (usually the Euclidean norm is used) on the set of real numbers R.

Time Elastic Kernels
From the DTW formulation, several attempts have been made to build kernel measures more suitable for machine learning purpose, in particular in the context of support vector machine.Distance substituting kernels were first introduced  (2007), which is, in most practically encountered conditions, positive definite (K ga ), and takes the following form: where κ(., .)= exp(−γ • ||., .|| 2 ) is a local kernel and A is the set of all admissible alignment paths.Marteau and Gibet (2014) proposed a general procedure to construct positive definite time elastic kernels from general time elastic distances.In particular, K rdtw , based on the design of a global alignment positive definite kernel for each single alignment path as given in Equation (3), has been defined such as close to the DTW matching scheme.
where this time C is any symmetric (in the sense that, if C contains an alignment path π, C also contains the symmetric path of π) subset of the set of all admissible alignment paths A between two time series, and K π (x, x') is a positive definite kernel associated to the path π and defined as: with κ a local kernel on R d .Typically, we use: where ν is a meta parameter for this method.Finally, soft-DTW Cuturi and Blondel (2017), written as dtw γ , was proposed to introduced a fully differentiable formulation from DTW.The essential idea is to replace the "hard" minimum operator by a "soft" expression that takes the following form with γ ≥ 0 (which is a meta parameter for this method): It is easy to show that there exists a direct relation between the global alignment kernel K ga (., .)and the soft-DTW: dtw γ (., .)= −γ • log(K ga (., .)).
Soft-DTW is considered as a state-of-the-art method for averaging a set of time series.Hence, in our experimentation, we evaluated it as a baseline method that we plugged into Steps 2 and 4 of our processing pipeline.

Time Elastic Averaging of a Set of Time Series
The multiple alignments problem has been widely studied in bioinformatics Fasman and L. (1998).It is known to be a NP-complete problem Wang and Jiang (1994); Just and Just (1999).Due to the "hardness" of this problem, heuristics have been proposed to provide centroid estimates in a reasonable time.
Among others, an iterative heuristic approach was initially introduced by Hautamaki et al. (2008) and popularized by Petitjean et al. (2011) who introduced the DTW Barycenter Averaging (DBA) algorithm.The iterative procedure, which integrates three steps, is first initiated by selecting a reference time series r, usually the medoid of the set S of time series that is to be averaged.The best alignments for all the time series in S with r are evaluated during the second step.In the third step, the reference is updated by averaging all the samples that are aligned with the same sample of r.The two last steps are iterated until reaching a local minimum of the summation of the DTW distances between the time series in S and r.
The soft-DTW Cuturi and Blondel (2017), which is fully differentiable in all of its arguments, has also been used to evaluate a centroid estimate for a set S of time series.Basically, in this case, the direct optimization problem can be solved using a gradient descent approach: where m i is the length of time series y i , λ i is a normalized weight associated to y i ( i λ i = 1), and C is the centroid we are seeking for.dtw γ (., .) is constructed by replacing in dtw the min operator with a softmin operator that introduces the γ meta parameter: These two time elastic averaging approaches (DBA and soft-DTW) constitute the state of the art in the context of averaging a set of time series.

Kernelized Time Elastic Averaging of a Set of Time Series
The averaging algorithm that we used to average a set of GPS trajectories is based on a probabilistic interpretation of the kernel alignment matrix (Equation (3)), as derived in Marteau (2019).This method is based on the recursive editing distance kernel, named REDK, which instantiates as K rdtw when DTW is considered as the editing distance.
The principle behind this interpretation is as follows.If we consider a stochastic alignment automata that, given two time series x and x', provides alignment paths, π, according to a probability distribution P π ≈ K π , then the cell (i, j) of the kernel alignment matrix (Figure 3, left) corresponds to the sum of the probabilities of the paths that allow aligning the sub time series x 0:i and x' 0:j .The kernel alignment matrix can thus be understood as a forward probability matrix.
Similarly, if we consider the backward alignment process (Figure 3, right), the cell (i, j) corresponds to the sum of the probabilities of the paths that allow aligning backwardly the sub time series  The forward-backward alignment matrix allows for the estimation of the expectation of the samples of x' that are aligned with sample x t (given that x t is aligned) as well as the expectation of time of occurrence, t of the samples of x' that are aligned with x t as follows: The expectation equations (Equation ( 10)) are at the basis of the procedure for averaging a set Let r 0:|r|−1 be a reference time series (r 0:|r|−1 ) that can be initially setup as the medoid of set X.The centroid estimate of X is defined as the pair (C, T ) where C is a time series of length |r| and T is the sequence of time stamps associated to the samples of Equations ( 10) and ( 11) are at the basis of the iterative agglomerative algorithm, called iTEKA (iterative Time Elastic Kernel Averaging), that provides a refinement of the centroid estimation at each iteration until reaching a (local) optimum, as presented in Algorithm 1.This algorithm was used, among other state-of-the-art averaging algorithms such as Soft-DTW, in Steps 2 and 4 of the processing pipeline depicted in Figure 1.
Algorithm 1 Iterative Time Elastic Kernel Averaging (iTEKA) of a set of time series.
1: Let K be a time elastic kernel for time series satisfying a probabilistic interpretation Equation ( 9) 2: Let X be a set of time series of d dimensional samples 3: Let C 0 be an initial centroid estimate (e.g., the medoid of X) of length n 3 Averaging a Set of GPS Time Series

Preprocessing the GPS Trajectories
Given a set X of GPS trajectory corresponding all to the same street or road segment, the averaging procedure presented previously cannot be used directly for several reasons: • The street segment is not necessarily traveled in a single direction.
• The trajectories are traveled with variable speed, hence the trajectories are possibly not sampled with the same level of detail or uniformly.
The first preprocessing step (Step 1 in Figure 1) consists in realigning the trajectories such that they could be considered as being traveled in the same direction.If x 0:n ∈ X, we denote x the reversed trajectory, basically x = x n:0 .The kernel defined in Equation ( 3) is used to reorganize X.This is achieved by selecting (randomly) one reference trajectory, x ∈ X and for all x' ∈ X, x' = x, if K rdtw (x, x ) ≥ K rdtw (x, x'), x will remain unchanged, otherwise, x' will replace x within set X.
The second preprocessing step is to re-sample uniformly the trajectories in the set X so that all trajectories contain the same number of samples, T .This is done using a linear interpolation of the segments that compose the trajectory.
By the end of the preprocessing procedure, all trajectories within set X are supposed to be traveled in the same direction and contain the same number of samples, T .

Averaging and Outliers Removal
The averaging is obtained using either medoid or centroid approaches, which corresponds to Steps 2 and 4 in Figure 1.When iTEKA centroid (Algorithm 1) is selected, the medoid according to the K rdtw measure is used to initiate the reference time series C 0 .
Once the centroid C of set X is obtained, the mean µ C,X , and the variance, σ C,X , of Log(K rdtw (C, x)) measure are evaluated, as x samples the elements of X.
For iTEKA approach, when σ C,X > τ , the time series x such that are removed from the set X, as far as |X| ≥ 3. Here, τ is a threshold that we empirically set to 5. Basically, all trajectories that are "log-distant" of at least one standard deviation are removed if σ C,X > 5, and are kept otherwise.For all other methods, the Log(K rdtw ) is replaced by the similarity measure that is used instead.This corresponds to Step 3 of the procedure depicted in Figure 1.
Once the outliers have been removed from set X, if any, a second averaging procedure is then carried out on the new set X' initialized with the previous centroid estimation, C 0 = C (Step 4 in Figure 1).The final centroid estimation C associated to the initial set X of GPS trajectories is finally provided by the averaging procedure depicted in Figure 1. Figure 5 gives an example of this outlier removal procedure used during Step 3. Note that this procedure does not guarantee that the centroid estimate would be closer to the ground truth.Sometimes, once the outlier removal has been applied, the centroid estimate worsens the assessment measure.However, at least on the training data, it brought on average some assessment measure improvement.

Post-Processing of the Centroid Estimate
The final step (Step 5 of the processing pipeline presented in Figure 1) consists in downsampling the centroid estimate C such that it contains the average number of samples that characterize the initial set of trajectories.This is achieved using a polygonal curve approximation procedure, as described in Marteau and Ménier (2009).

Experimentation
The experimentation was carried out using the training dataset provided by the Averaging GPS segments competition setup Fränti and Mariescu-Istodor (2019) at University of Eastern Finland (https://cs.uef.fi/sipu/segments/).It consisted in estimating a cleaned GPS trajectory segment given a set of GPS trajectories corresponding to the same segment.Only training data corresponding to a set of road segments were delivered along with ground truth trajectory for each considered road segments.
It is important to note that the assessment measure used to evaluate the competing approaches was not explicitly provided.However, on the training data, the challenge site makes it possible to obtain the average value of the assessment measure obtained by a given method by submitting the set of solution trajectories produced by this method.We show below that this unknown assessment measure used to rank the competing methods is not strongly correlated to a RMSE measure between an estimated trajectory and the corresponding ground truth trajectory.The assessment measure (as we learned once the challenge was closed) is referred to as HC-SIM, a hierarchical version of the C-SIM measures described in Mariescu-Istodor and Fränti (2017).The C-SIM measure is based on the notion of grid which partition the 2D space in contiguous cells of 25 m 2 .To compare two trajectories, the Jaccard index was evaluated by performing the ratio of the common cells shared by the two trajectories with the union of the cells traversed by the two trajectories.To avoid the effect of the discretization of the grid, the trajectories were slightly dilated, which had the effect of enlarging a bit of the trajectories by adding some of the adjacent cells.The HC-SIM (H for hierarchical) measurement was derived from the C-SIM measure by varying the size of the cells and providing a weighted average as output.The details of this measure have not yet been published by the authors.However, as mentioned above, an evaluation program allowed producing the results presented below.The HC-SIM measure gives a percentage of similarity between two trajectories (hence, it varies in [0, 100%]).
All approaches shares the T meta parameter, which defines the size of the resample trajectories.In addition, K rdtw , iTEKA as well as the soft-DTW medoid and centroid methods require the set-up of two meta parameters, ν and γ, respectively, the bandwidth of the local kernel parameter (Equation ( 4)) and T , the length of the trajectories, once they have been resampled after the second preprocessing step.
For all methods, these meta parameters were varied for all training sets of trajectories simultaneously, such as maximizing the HC-SIM measure obtained by the centroid estimates.
According to Figure 6, the meta parameters ν and T are correlated for the iTEKA algorithm.On the training data, a simple grid search led to selecting ν = 6 and T = 200, which allowed reaching a HC-SIM of 68.5%.
Similarly, according to Figure 7, the meta parameters γ and T are correlated for the soft-DTW centroid.On the training data, a simple grid search led to selecting γ = 2 and T = 100, which allowed reaching a HC-SIM score of 67.39%.Figures 6 and 7 show that over-sampling has an important impact on the HC-SIM measure of the estimated centroids.
As stated above, the HC-SIM measure is not strongly correlated to the RMSE measure.The correlation between RMSE and HC-SIM measures when varying γ parameter for soft-DTW or ν for iTEKA is, respectively, −61% and +67%.This shows the difficulty of this challenge, since the ground truth provided on the training data does not directly help to train the models.Hence, selecting the minimum RMSE value would not lead to the best HC-SIM measure.
Table 1 synthesizes the results obtained by the eight tested methods.It turns out that centroid based methods are much more accurate than the medoid ones.The averaging scheme is thus quite important.One can notice that, on this benchmark characterized by small and simple trajectories, the Euclidean averaging performed quite well, reaching a 67.15% HC-SIM when outlier removal was considered.This is better than soft-DTW that obtained 66.93%.The best method on this benchmark is iTEKA that reached 67.63% HC-SIM with outlier removal.However, with the absence of an analytical knowledge of the HC-SIM measure that is used, we cannot provide confidence intervals or state whether these results are significant or not.
The final results of the challenge (http://cs.uef.fi/sipu/segments/results.html), as provided by the organizers, are given in Table 2.When no post-processing was used, iTEKA method ranks first (Method A), but, as shown in the last two columns of the table, the method is slow and induces a spurious number of points in the averaged trajectory that is provided.When reducing the number of points of the centroid trajectory, using the down-sampling postprocessing, the HC-SIM quality measure dropped, as shown for Method E that corresponds to the processing pipeline presented in Figure 1 when iTEKA was used.The slight differences in the results apparent in Tables 1 and 2 are probably due to a slight change in the HC-SIM measure that produces differences in the selection of the meta parameters for the submitted results.Finally, Figure 8 presents the elapsed time in a logarithmic scale when T increases (the length of the re-sampled trajectories) for the centroids approaches.The Euclidean centroid method is clearly the most efficient one, as expected, followed by the iTEKA method that is significantly faster than the soft-DTW centroid one.The least efficient method is clearly DBA.Indeed, although all the tested algorithms were run on the same architecture and operating system, the observed differences of processing efficiency may be due, at least partly, to difference in the implementation choices.The medoid-based methods are more costly since their dependence with the size of the set is quadratic, while it is linear for centroid-based methods.

Conclusions
We have described a procedure for cleaning noisy sets of GPS trajectories corresponding to road segments.This procedure includes, in the first stage, an oversampling of the trajectories, prior to the calculation of a centroid or the search for a medoid, which composes the second stage of the procedure.It also includes, as a third stage, the detection and suppression of potential outliers, which improves on average the HC-SIM measure for almost all centroid based methods.
A down-sampling finalizes the procedure to produce centroid/medoid estimates whose lengths match the average length of the input trajectories.
The experiment allowed comparomg time elastic and Euclidean averaging approaches with more straightforward medoid approaches.
Our experimentation showed that: (i) centroid based methods outperform medoid based methods; (ii) the outlier detection and removal step improves on average the HC-SIM of final centroid estimation, but not the HC-SIM of the medoid selection; and (iii) over-sampling seems to also be a valuable step.
With the limited training data, it cannot be guaranteed that the comparative results presented here are effectively significant.However, it clearly emerged from our experimentation that centroid-based approaches outperform medoid-based approaches.Furthermore, the algorithmic complexity is clearly in favor to centroid-based approaches, since it is linear with the size of the sets of trajectories that are processed, whereas it is quadratic for medoid-based approaches.
As a perspective, to improve the HC-SIM quality of the cleaned trajectory estimation, one should consider the optimization of the meta parameters (T and τ essentially, as ν or γ may be considered as a function of T ) on each segment of road (instead on the whole set of segments), according to its topology.In that line of improvement, one should try first to clusterize the GPS datasets according to the segment shapes, and then optimize for each cluster the meta parameters.
2. a first extraction/estimation of a medoid/centroid for a subset of GPS trajectories (Sections 2.4 and 3.2); 3. anomaly (outlier) detection and removal (Section 3.2); 4. a second extraction/estimation of a medoid/centroid for a subset of GPS trajectories (Sections 2.4 and 3.2); and 5. a final down-sampling to reduce the sampling precision of the trajectories down to the average sampling precision of the initial set of trajectories (Section 3.3).

Figure 2 :
Figure 2: Three possible alignments path (green, red, black) between time series x and x'.

Figure 3 :
Figure 3: The forward (left) and backward (right) alignment kernel matrices

Figure 4 :
Figure 4: The forward-backward alignment kernel matrix

4:C
Let T and T 0 be two sequences of time stamps of length n initialized with zero values 5: Let M eanK 0 = 0 and M eanK be two double values; = C 0 , T = T 0 , M eanK = M eanK 0 ; 8: Evaluate C 0 and T 0 according to Equation (11) //Average similarity between C 0 and elements of X M eanK ≤ M eanK 0 11: (C, T ) is the centroid estimation 12: Finally, uniformly re-sample C using the time stamps T An early version of iTEKA was first published on Arxiv site in 2015 Marteau (2015).

Figure 5 :
Figure 5: iTEKA centroid estimate (in red) before pruning (left) and after pruning (right).The blue lines represent trained data, i.e. measured GPS trajectories corresponding to a street segment.The x,y coordinates are latitude and longitude converted in UTM coordinates then normalized in [0; 1].

Table 1 :
Best HC-SIM (in %) obtained with or without outlier removal for the eight tested methods.The best obtained HC-SIM value are indicated in bold cases.

Table 2 :
Challenge results as produced by the organizers: ranking of the competing methods according to the HC-SIM measure (in %).The iTEKA method corresponds to methods A and E.