Clustering Moving Object Trajectories: Integration in CROSS-CPP Analytic Toolbox

: Mobile devices equipped with sensors are generating an amount of geo-spatial related data that, properly analyzed can be used for future applications. In particular, being able to establish similar trajectories is crucial to analyze events on common points in the trajectories. CROSS-CPP is a European project whose main aim is to provide tools to store data in a data market and to have a toolbox to analyze the data. As part of these analytic tools, a set of functionalities has been developed to cluster trajectories. Based on previous work on clustering algorithms we present in this paper a Quickbundels algorithm adaptation to trajectory clustering . Experiments using different distance measures show that Quickbundles outperforms spectral clustering, with the WS84 geodesic distance being the one that provides the best results.


Introduction
Nowadays, there is a growing number of devices connected to the internet, and an exponential rise is expected, along with the development of 5G communication standards; see [1]. The availability of data and the increase in computation capacity have boosted the proliferation of new applications. Geolocation services have attracted great interest due to the easy collection of geopositional data through all kind of devices. This type of data can proceed from vehicle services http://vis.cs.kent.edu/demo/list_of_trajectory_ datasets.pdf (accessed date 17 April 2021), electrical micro-mobility (bikes and scooters) https://www.kaggle.com/pronto/cycle-share-dataset/code, 17 April 2021, peoplecollected data https://archive.ics.uci.edu/ml/datasets/GPS+Trajectories (accessed date 17 April 2021), and public transport https://data-crtm.opendata.arcgis.com/ (accessed date 17 April 2021). These new services have realized the relevance of analyzing moving patterns of their customers as a key characteristic to improve their value proposition and competitivity. Obtaining knowledge from moving agents' data has applications such as bus routes optimization [2], prediction of collective behavior [3,4], customized trips, or fleet management in the mobility field [5,6]. Additionally, the maturity of the mentioned services is impacting mobility in cities. The study of citizens' movements patterns may push forward more customized transportation [7]. This is a part of the more global concept of smart transportation and smart cities [8].
The surge in data from moving objects faces several challenges. One of them is how to extract useful insights of the data. Clustering of trajectories refers to grouping them based on a similarity measure [9]. This is a task that serves as a exploratory step in the discovery of patterns in moving objects and enables one to synthesize the information by establishing a representative trajectory for each cluster. By doing so, it serves as a base from which to start a deeper analysis of the characteristics of the trajectories.
A second challenge is related to quality assurance and trust. Among the diversity and randomness typical of the data of moving objects, it is vital to establish the data that correspond to real behaviors and those that are due to errors in the sensors or systems. This task is referred to as outlier/anomaly detection. It is domain-specific and is also used for detection of fraudulent activities, since these activities usually fall away from expected common behavior.
Another challenge in the application of trajectory data analysis is the development of user-friendly tools that enable the development of services to extract knowledge from integrated data sources. An obstacle is the dispersion of libraries due to most of them being domain-specific. Moradi et al. [10] mentioned more than 20 different packages in R, mostly focused on animal trajectories. In their work, they propose a set of data structures, classes and methods to deal with generic trajectory data. Another option is the MovingObjects library [11] implemented in python, which is based in the Geopandas data structures. The mentioned libraries do not offer no-code options, and thus, they present a learning curve for non-tech users. The open-source TrajAnalytics software [12] offers a web-based visualization interface that simplifies the analysis process for these users. In this work, we present the CROSS-CPP projecthttps://www.cross-cpp.eu/, 17 April 2021; its main aim is the development of a system for the integration and analytics of data streams coming from high volume (mass) products with cyber physical features. Its purpose is to offer new cross-sectorial services and focus on the commercial confidentiality, privacy and ethical issues using a context-sensitive approach. In the framework of this project, a toolbox has been developed to provide the analysis services. It has been designed as a easy-to-use tool thanks to its user interface that enables to analyze moving objects' anonymized data. The toolbox includes algorithms for different machine learning tasks, on both stream and batch mode. In particular, we focus in this paper on the clustering of moving objects.
Quickbundles [13] is known to be a clustering algorithm for tractography segmentation that has shown good performance in the health domain. Tractografies, like moving object trajectories, may be defined as streamlines. Thus, we propose in this paper to adapt Quickbundles for trajectory clustering and to integrate it into CROSS-CPP analytic toolbox. The adaptation is not straightforward, and some challenges have to be faced: (i) Quickbundles assumes streamlines being defined by a settled number of points, and as a consequence, in this paper, we solve the problem to define different trajectories under the same number of points; (ii) for the distance among streamlines, the measure of the distances is adjusted for moving object trajectories.
The paper presents the results of the experiments carried out with two datasets: (i) Porto Kaggle dataset https://www.kaggle.com/crailtap/taxi-trajectory. (accessed date 17 April 2021) and (ii) CROSS-CPP dataset (Due to privacy constrains the dataset is not publicly available). We also show how the integration looks like in CROSS-CPP and how it can be accessed. Consequently, the main contributions of the paper are as follows: (i) the adaption of Quickbundles algorithm for mobile data, (ii) the analysis of its performance in real trajectories datasets and (iii) the integration into the CROSS-CPP environment.
The rest of the paper is organized as follows: Section 2, Related Work, reviews the approaches for clustering trajectories, the used distance measures and models of the Earth surface. Section 3, Preliminaries , include the theory related with the topic and in particular trajectory definition, distance measures and comparison of trajectories. Section 4 contains the datasets' description, the preprocessing of the trajectories data, the validation metrics and results. In Section 5, it is explained how a non-tech person can perform the end-toend process of harvesting the data and perform a clustering analysis in the CROSS-CPP platform. Thanks to a graphic user-interface, the learning-curve is lower for non-tech professionals. The last Section 6 discusses the results, analyzes the pros and cons of the proposed approach, highlights future work and extracts conclusions.

Related Work
As mentioned above, the proposed approach adapts the algorithm presented in [13] for clustering and performs anomaly detection on that basis. Consequently, in what follows, we review previous approaches for trajectory clustering and detection of anomalies.

Clustering of Trajectories
Trajectory clustering is a common practice in trajectory data mining [9,14]. It consists in grouping the trajectories based on a similarity measure. The nature of trajectories as multidimensional data enables different clustering approaches that may be domain-or application-specific.

Distances between Trajectories
An important aspect in the clustering of trajectories is the computing of the distances among them. The first point to be accounted for is that moving objects travel the Earth in most cases. Therefore, the distance metric may consider the curvature of its surface. This is the case for the great circle and ellipsoids geodesic distance. In other cases, the effect of the curvature of the Earth is negligible, so the most used metric is the euclidean, but there are others like the Hausdorff or the longest common subsequence [9].
Distances between two points on the Earth: in geometry, the ancient Greeks defined the geodesic as the shortest distance between two points on a surface. There are several formulas to calculate the geodesic for the surface of the Earth, with lesser or greater error depending on the position on the Earth's surface. In this regard, the most simple model is to approximate the Earth as a sphere; then, the geodesic distance is also called the great-circle or orthodromic distance [15]. Another approach is to consider an ellipsoidal model of the Earth; then again, there are several models of ellipsoids with variable error depending on the position. The WGS-84 ellipsoid is accepted as the most globally accurate.
In many cases, trajectories are constrained to spatial locations of the size of an area small enough to assume it is flat; these cases neglect the curvature of the planet. This case the most common in the literature with examples like taxi routes analysis; see [6] for a review of different approaches.

Spatial Clustering of Trajectories
Once the distances among trajectories are computed and stored, different clustering algorithms may be applied. Each of them offers pros and cons, and the best solution is domain-specific.
This project focuses on geometric properties of the trajectories, defined by characteristics like spatial distribution, proximity or length. The predominant algorithms are those based on density such as DBSCAN [16] or K-means [17], where the proximity is the most important feature. One of the drawbacks of K-means is its tendency to form spherical clusters, which is inadequate for clustering several streamlines as in the present case.
For clusters with irregular or non-spherical shapes, one of the most efficient algorithms is the spectral clustering [18]; it uses the eigenvalues, also referred to as the spectrum.
Quickbundles [13], presented before, is a clustering algorithm developed to reduce the dimensionality of brain streamlines datasets provided by tractography. It stands out due to its linear complexity with the number of streams and its interpretability.

Preliminaries
Prior to presenting the approach to cluster trajectories, some preliminary definitions are introduced.
A trajectory is the path followed by a moving object. Formally, a trajectory s is defined as a chronological sequence of multi-dimensional locations, which is denoted by s = P1, P2, ..., P n , with n being the number of points. Each point P n is represented as Location j -tuple at time T j [9].
The concept of distance among trajectories refers to the criteria to measure the similarity of trajectories. In this paper, the similarity focuses on the geometric patterns of trajectories. Two trajectories are identical if they have the same length and are defined by the same positions independently of the time. This kind of similarity is well-suited for analysis of routes in cities and countries, such as commuting routes from homes to work places, hot-spot areas and holiday road trips.
The geodesic distance is the shortest distance between two points located on the surface of the Earth, and there are several models to compute this distance: • Harvesine, or half of the versine, which is the central angle (Θ) between two points in an sphere: where d is the distance between two points along a great circle of the sphere and r is the radius of the sphere. The Haversine (hav(Θ)) is computed directly from the latitude and longitude of the two points: where ϕ 2 , ϕ 1 are the latitude of point 1 and 2. λ 2 , λ 1 is the longitude of points 1 and 2.
The haversine function computes half a versine of the angle θ: Then the distance between the two points: • Great circle distance [15]: In geometry, the great circle is defined as the intersection between a plane and a sphere that passes through the center of the sphere. This formula was manually implemented, and it is defined as follows: where the d is the distance, R is the Earth radius, a constant with R = 6371 km, and the same notation as before for the latitude (ϕ) and longitude (λ) of each point.
• Euclidean: by neglecting the curvature of the Earth, the euclidean distance between the two points is defined: • Geodesic WS84 [15]: in this case, the model of the Earth is an ellipsoid; in particular, the WS84 is the most globally accurate model. The used implementation is that included in the GeoPy library.
All the metrics defined above are suitable to calculate distances between two points. To compute distance among trajectories, different approaches have been proposed [9,19,20]. These approaches distinguish between trajectories defined by the same number of points or dissimilar. In the case of trajectories defined by the same number of points and lineal time complexity, there are two distances that have been used in moving objects data: the euclidean and the Hausdorff [21]. In this paper, we explored the use of Minimum Direct-Flip (MDF) [13], see Equation (7).
where |s i − t i | denotes the Euclidean distances between two points, s is a trajectory and t is the mean of the Euclidean distance. Figure 1 provides an insight of the working of the metric.

Quickbundles
Quickbundles (QB) [13] is a clustering algorithm with linear time performance with respect to the number of trajectories.
QB computing time depends lineally to the number of samples because (i) it uses MDF with linear complexity and (ii) there is no reassignment of the trajectories clustering. This means that trajectories are evaluated only once when performing the clustering and opposed to K-means, where there is amalgamating phase. Clusters are defined by a triplet c = (I, h, n), where I is a list of indexes of the trajectories in the cluster, n is the number of trajectories in the cluster, and h is the trajectories sum. h is a Kx2 matrix that can be updated for new trajectories (see Equation (8)).
This way, the centroid of the cluster is defined by v: Among the strengths of this clustering algorithm, we have its above-mentioned lineal complexity, the easiness to interpret the clusters and the versatility provided by changing the number of points that define each trajectory and the threshold value to consider trajectories being similar. All of these become very useful in the exploration of a dataset of trajectories.
However, QB was developed to overcome the complexity of segmentation of the brain fibers. In order to apply it to cluster trajectories, the following are required: • All trajectories have be defined by the same number of points. • A distance threshold to specify the maximum value to which two trajectories are clustered together has to be defined.

DataSet Description
The experiments presented in this section were performed using the Kaggle dataset of taxi trajectorie https://www.kaggle.com/crailtap/taxi-trajectory, (accessed date 17 April 2021). both It contains trajectory information from 1 July 2013 to 30 June 2014 for 442 taxis running in the city of Porto, in Portugal. In total, 1,704,757 trajectories are available.
Each register contains 9 features, described as follows: • TRIP_ID: (string) It contains an unique identifier for each trip. For this paper, the TIMESTAMP and POLYLINE features were used.

CROSS-CPP Dataset Description
The original unprocessed data included 6094 different trajectories from 480 vehicles collected from May 2019 to the October of 2021. The only attribute of the data is the GPS streamline.
During the preprocessing step, it was found out that 742 trajectories were defined with less than 100 coordinates, and they were removed. In conclusion, 5852 trajectories were taken into account for the final experimentation.

Parameter Setting
Experiments were carried out to compare the behaviour of two clustering algorithms: Quickbundles and spectral clustering.
In order to validate results, ten different experiments with batches of 2000 trajectories were conducted for each algorithm and distance metric. A total of 80 experiments were run (10 per pair algorithm-distance metric). The mean Silhouette Coefficients of each set of the 10 experiments carried out was analyzed.
To build each experimentation batch, a different approach was set depending on the data source: • Kaggle data: since there were 1,704,757 trajectories, each experiment run for an algorithm-distance metric pair comprised 2000 different trajectories each time, guaranteeing no repetition of trajectories in each experiment.
• Cross-CPP data: data from only 5852 remained for experimentation after filtering, and each experiment was run over a subset of 2000 randomly selected trajectories, unavoidably repeating trajectories between experiments.
In each experiment, the hyperparameters of QuickBundles were adjusted via a 10-fold cross-validation approach. The adjusted hyperparameters were:

•
Resampling factor: number of coordinates to which each trajectory will be resampled (as the algorithm requires all trajectories to have the same number of coordinates). • Neighboring threshold: distance at which two trajectory points are considered neighbors.
The spectral clustering implementation used the radial basis function (RBF) kernel for its construction. It is to note that, for the Spectral Clustering algorithm, only the number of clusters hyperparameter was checked. Results are displayed for its maximum silhouette coefficient value.

Validation Metrics
The silhouette was the validation metric used, based on the definition of Rousseeuw [22]. i is a data point in the cluster C i . a is the mean distance between i and all other points in the cluster: The mean dissimilarity of point i to some cluster C k is the mean of the distance from i to all points in C k . b is the smallest distance of i to a cluster different from C i : Then, the silhouette value for one point i is given by: The term silhouette coefficient is defined as the maximum value of the mean s(i) over all data of the entire dataset: where t(k) represents the mean s(i) over all data for the number of clusters k.

Results
The following tables show the average of the silhouette coefficient for the clustering algorithms and the distance measures in each of the datasets (Tables 1 and 2). One can observe that Quickbundles clearly algorithm ourperforms Spectral clustering. Observe that the geeodesic distance clearly yields the best results even when points to calculate distances in the dataset are not much apart.

Integration in CROSS-CPP
As was mentioned in Section 1, the CROSS-CPP is an IT environment for the integration and analytics of data streams coming from high volume (mass) products with cyber physical features. Consequently, in this section, we will firstly review the process that has to be followed to harvest data and then make available for the data analytics toolbox to show how the proposed clustering algorithm has been integrated in Cross-CPP. Once we describe the integration on the CROSS-CPP framework, we will analyze how to run and the results of different executions using real datasets.
In what follows, the process performed to calculate clustering of trajectories of the data available at the platform is analyzed. The process is divided into two phases: Obtaining access to the data (data harvesting) and performing the desired analysis-clustering in this case.

Data Harvesting
The accessing data phase started by logging into the Agora website at the "Home" tab and choosing vehicles' origin data. Figure 2 shows the screen where basic statistics are displayed for the selected data: The process continued by moving to the "Data Discovery" tab and selecting the type of signal to be used. Filtering by vehicles, each signal contains the values for a vehicle features, i.e., vehicle speed, latitude, and temperature. This information is contained in a channel with an unique identifier. The "position", channel number two, signal was chosen, containing the latitude and longitude as can be seen in Figure 3. Once the channel was selected, no geographical or temporal filters were added in the example. Then, the platform provided useful insights in the form of a heatmap for geographical reference and a slide bar for the time distribution of the trajectories as shown in Figure 4. In addition, general statistics are also given as can be observed in Figure 5. The analytic phase starts upon the approval of the data request. In the "Toolbox" and trajectories tab, the above-mentioned data request can be selected. This provides the option of adding temporal or geographical filters again, although in our example, none of these options were used . On the other hand, in the analysis option, we selected "Clustering", as it can be observed in Figure 6, in order to execute the clustering.

Discussion and Conclusions
In this paper, we have presented results of the integration Quickbundles algorithm for the moving objects' trajectory clustering in the CROSS-CPP platform. Two main aspects must be discussed: (i) integration issues and (ii) clustering algorithm performance.
In relation to integration, the CROSS-CPP marketplace makes it possible for data providers to store data so that service providers can use these data to generate applications based on the knowledge extracted from these data. In order to extract the knowledge from the data, the analytic toolbox has been developed and integrated. One of the main advantages of the analytic toolbox is its flexibility to include new functionalities; in particular, in this paper, we have shown how clustering for moving objects has been integrated. More precisely, the Quickbundles algorithm has been adapted from the health domain to calculate the clustering of trajectories. In the paper we have shown the end-to-end process (see Section 5) from data harvesting to data analysis. In relation to performance, the Quickbundles algorithm has been proposed in this paper for trajectory clustering. We chose this instead of center-based clustering algorithms such as K-means due to the fact that spherical clusters are not appropriate for trajectory clustering. As a consequence, we conducted experiments to compare the behavior of spectral clustering and Quickbundles using different distance measures. The obtained results show that Quickbundles outperforms spectral clustering, with this behavior being coherent in both datasets: the Porto Kaggle and the CROSS-CPP data.
The silhouette metric was used to compare results. The silhouette coefficient metrics are known to be a ratio that depends on both the cohesion, that is to say the similarity inside each cluster, and on the separation, that is, the distance of points belonging to different clusters. From analyzing the obtained results, one could conclude that QB algorithm obtain clusters with a better ratio between cohesion and separation. Spectral clustering may yield clusters with better cohesion or separation metrics, but not both at the same time. However, it would be a future work to analyze these metrics in isolation.
In relation to experimentation, not all of the possible parameters for the compared algorithms have been explored. In particular, for spectral only, the number clusters has been modified in the different executions. For the case of QB, we ran experiments adjusting both the resampling factor and the neighboring threshold. In particular, for QB the experiments raised the fact that the resampling factor does not offer great changes in performance but the neighbouring threshold was shown to be determinant. This is in accordance with what has been already mentioned in the literature.
For experiments run with spectral clustering, we chose the number of clusters yielding the best results. Future work should include a deeper study by using different methods to construct the affinity matrix and fine-tuning the C and σ parameters, among others.
In relation with the distance measure, despite the fact that not high differences in silhouette coefficient were observed, results yielded slightly better results when using the geodesic WS84 in both datasets. Note that this distance improved performance even in the case of Porto, where the difference of distances is not very big.
Regarding time consumption, although experiments were not conducted to measure this issue, we could observe that the spectral algorithm took up to three times less the time to execute in comparison to Quickbundles. Although this should be further explored, results obtained are in accordance to literature: in [13], the authors establish that the time of execution of QB increases linearly with the number of trajectories .
Finally, the clustering was performed in batch mode. However, it is important to note that the CROSS-CPP analytic toolbox allows for data-streaming mode. In future works, it would be interesting to analyze the behavior of the algorithm for that setting.
In conclusion, in this paper, we have presented results of the integration of Quickbundles algorithm for the moving objects trajectories clustering in the CROSS-CPP platform. It has been successfully integrated and is available in the CROSS-CPP project. The reason to choose this algorithm was based on the results shown, in which so far the QB algorithms is shown to outperform the spectral clustering algorithm.
As was already mentioned, future work shall include experimenting with streaming clustering algorithms for real-time analysis. Furthermore, deeper studies to analyze the effect of the hyperparameters on the behavior of the algorithms as well as to measure the time consumption should be required to decide the best option depending on the amount and nature of the dataset.

Conflicts of Interest:
The authors declare no conflict of interest.