A Distributed Execution Pipeline for Clustering Trajectories Based on a Fuzzy Similarity Relation

The proliferation of indoor and outdoor tracking devices has led to a vast amount of spatial data. Each object can be described by several trajectories that, once analysed, can yield to significant knowledge. In particular, pattern analysis by clustering generic trajectories can give insight into objects sharing the same patterns. Still, sequential clustering approaches fail to handle large volumes of data. Hence, the necessity of distributed systems to be able to infer knowledge in a trivial time interval. In this paper, we detail an efficient, scalable and distributed execution pipeline for clustering raw trajectories. The clustering is achieved via a fuzzy similarity relation obtained by the transitive closure of a proximity relation. Moreover, the pipeline is integrated in Spark, implemented in Scala and leverages the Core and Graphx libraries making use of Resilient Distributed Datasets (RDD) and graph processing. Furthermore, a new simple, but very efficient, partitioning logic has been deployed in Spark and integrated into the execution process. The objective behind this logic is to equally distribute the load among all executors by considering the complexity of the data. In particular, resolving the load balancing issue has reduced the conventional execution time in an important manner. Evaluation and performance of the whole distributed process has been analysed by handling the Geolife project’s GPS trajectory dataset.


Introduction
In our time, billions of events are generated by location-aware devices in a matter of seconds.The extracted events' knowledge can be leveraged to increase economic gain, reinforce security and support decision making.In particular, the pattern analysis of the moving object trajectories can make it possible to understand or predict behaviour.Hence, cluster analysis has been quite indispensable.Still, the real-time, fuzzy, heavy nature of events has been a burden for conventional clustering approaches.Consequently, researchers have taken interest in fields like big data, online clustering and fuzzy logic.
Big data technologies often refer to notions like concurrency, distribution, parallelism and stream processing.Mainly, deployed in a cluster environment where the memory and storage can be shared by the cluster's nodes.The nodes can be located in the same location or distributed among regions with a possible peer-to-peer or master-slave connection strategy.High Availability, mass storage, mass computing and fault tolerance are primarily featured by big data ecosystems.In this study, we take interest in leveraging the functionalities of Spark (https://spark.apache.org/)into clustering trajectories.The choice is supported by the fast computing nature of spark explained by its use of the Random Access Memory (RAM).In particular, we analyse the Geolife project's GPS trajectory dataset [1][2][3].The dataset is stored in HDFS and an entire pipeline for reading, manipulating and defining clusters is being detailed in this paper.Clustering is done by computing the max-min transitive closure of a fuzzy proximity relation.Several trajectory similarity indices exist on the literature [4] and reasoning about which one is more accurate than the others is out of the scope of this paper.Consequently, for simplicity we consider the Longest Common Subsequence's (LCSS) length [5].In the process, we have included the max-min, max-delta and max-product transitivity.An option of both the smart [6] and semi-naive [7] algorithms is deployed for computing the transitive closure.The system is implemented in Scala; hence, futures, traits, objects and higher-order functions are integrated to enrich the system's model.The pipeline's activity diagrams, system's class diagrams and different algorithms are detailed.Furthermore, we have encountered an issue related to unbalanced workloads; thus, we propose a new partitioning logic that considers the complexity of data and equally partitions it across the executors.By resolving the load balancing issue, the computation time was significantly reduced.After defining the clusters, we persist the results in a MongoDB (https://www.mongodb.com/)collection by leveraging the reactive mongodb driver (http://reactivemongo.org/)known for its asynchronous nature.This final step can be extended to infer knowledge and analysed for understanding Spark's behaviour in the asynchronous boundaries.To the best of our knowledge, this is the first research giving a detailed implementation of the raw trajectories clustering process in Spark by leveraging a fuzzy similarity relation, analysing the performance of repartitioning tasks and providing a new partitioning strategy for handling trajectory data skewness and empirical proof of its efficiency.
The rest of this paper involves three sections: Section 2 discusses the existing literature; Section 3 yields a background over fuzzy logic, fuzzy relations, LCSS problem and the trasitive closure; Section 4 provides the different algorithms, class and activity diagrams; Section 5 provides the evaluation of the different partitioning stages; Section 6 concludes our paper and highlights future work.

Literature
Since Zadeh initiated fuzzy logic [8], researchers have taken interest in defining fuzzy clusters, which reflect more the fuzzy nature of features, instead of crisp ones.Consequently, fuzzy clustering approaches have emerged.Research on fuzzy clustering can be categorised into approaches leveraging an objective function (fuzzy c-means) and approaches manipulating a proximity relation.The last ones can be further classified into approaches handling directly the proximity relation and approaches proposing techniques to convert the proximity relation into a similarity relation then apply a specific clustering technique to get the different partition trees [9].
Research on directed approaches include [10].The authors proposed a new approach based on the entropy measure, which lies in [0, 1].The main idea is that this measure between two nodes equals 0 if they are respectively very close or very far from each other and 1 if they get close to the average distance between all nodes.The entropies are measured from a proximity relation and cluster centres defined by the lowest entropies.Boulmakoul et al. [11] defined a quadratic time complexity algorithm for defining clusters from a proximity relation.The authors define a cluster as a maximum weighted clique of nodes.The weights refer to the different similarities.The obtained clusters are both compact and well separated.Both the supra works define a cluster node as a node similar to all the other nodes in the same cluster.In contrast, Kondruk [12] defined an approach based on cluster centres.Each cluster node is similar to the centre and the centres are updated at each iteration.Different algorithms exist in the literature to identify the transitive closure.The most known ones are the smart [6] and semi-naive [7] algorithm.They are leveraged to extract a similarity relation leading to extended clusters.Tamura et al. [13] introduced the max-min transitivity that yields a similarity relation having equivalence relations as a resolution form.Each equivalence relation leads to non-overlapping partition trees.Yang et al. [14] defined max-prod and max-∆ as an extension of the previous work.However, the resulting relation can lead to overlapping partitions.Hence, the necessity for specific algorithms to extract non-overlapping clusters based on a specific cluster definition.Also, the directed approaches can handle these max-t transitive relations to extract non-overlapping clusters.Liang et al. [15] identified clusters by considering a metric of trapezoidal numbers and leveraging the max-min transitivity.
In the scope of the latter approaches, relatively few works have been conducted into evaluating the performance of computing the transitive closure in a distributed environment.In particular, studies [16,17] have proposed new fragmentation techniques for computing the transitive closure in a parallel manner.Gribkoff [18] evaluated the smart and semi-naive algorithm in Hadoop map/reduce environment.Although Spark outperforms Hadoop in processing, none of the previous works have evaluated the transitive closure in its environment.An integration in Spark was only given in [19,20].Initially, we extended the approach of [11] by leveraging Spark into clustering spatial trajectories.Still, our proposition could not fully distribute the clustering process.Moreover, we have not considered the transitive closure and we did not give a detailed implementation of the distributed proximity relation construction process.In the latter, we applied the max-min transitivity over a proximity relation referring to the similarities between cyber-criminals in Twitter.Unfortunately, we did not evaluate the performance nor have we considered spatial data.Decidedly, the literature includes several works related to trajectory clustering [21] and its applications on resolving real world problems [22].Still, to the best of our knowledge, this is the first work discussing the exploit of the transitive closure on a fuzzy similarity relation to extract clusters of raw trajectories by using Spark.

Background Definition 1. Let A be a lexical attribute (warm). The fuzzy set characterising this attribute is denoted
The µ A defines the degree of membership of a lexical variable to the set A. Thus, each element of the universe of discourse X = {x 1 , x 2 ..., x n } can be attributed to several sets with a varying membership degree [23].Definition 2. A couple of crisp sets' elements can be related to each other via a relation.The relation is binary and can be formulated as R = { (x, y), µ R (x, y) | x ∈ X, y ∈ Y}.In case the relation is crisp the strength of the relationship µ R : (X, Y) equals 0 or 1.In contrast, a binary fuzzy relation has its membership degrees in [0, 1].In our study, we consider the latter relation.
R has the resolution form R = R can be denoted a proximity relation if it is both reflexive and symmetric.In particular : Additionally, if R respects transitivity, it can be refered to as a fuzzy similarity relation.Transitivity is defined as µ R (x, z) ≥ max(T(µ R (x, y), µ R (y, z))).T is a t-norm defining the nature of the transitivity; several norms of this kind exist including : respectively form the set of max-min, max-prod and max-transitive relations [24].In particular, the max-min transitivity yields a fuzzy similarity relation that characterizes each crisp relation in the resolution form as an equivalence relation.Different partition trees can be obtained for each α-level from these equivalence relations [14].(R(x, y), R(y, z)).In our paper, we deploy both the semi-naive and smart algorithm for computing the transitive closure (please see the Section 4.1 for a precise implementation).

Definition 4.
Trajectories can be mainly considered as a temporally ordered sequence and not forcibly related to space in this case they are considered metaphorical.Metaphorical trajectories describe the variations of an attribute over time.Decidedly in the case of an attribute a, Tr a = {(a 0 , t 0 ), ..., (a n , t n )} with t i as a timestamp.Another representation considers the variations of an attribute over abstract spatial regions instead of spatial coordinates, e.g., cities and countries.This representation is referred to as naïve and Tr a = {(a 0 , t 0 , C 0 ), ..., (a n , t n , C n )} with C i the country or region code.In contrast, raw trajectories consider spatial coordinates and describe the spatio-temporal profile of a moving object.Each trajectory has the form Tr (object id ,trajectory id ) = {(long 0 , lat 0 , t 0 )...(long n , lat n , t n )} with lat i and long i respectively denoting the latitude and longitude degrees [25].In our study, we handle the raw trajectories of the Geolife project's GPS trajectory dataset [1][2][3].Definition 5. Let Tr 1 and Tr 2 be two trajectories, Tr 2 is a subsequence of Tr 1 if all the points in Tr 2 match the points in Tr 1 in an ordered manner with gaps support.The match can reflect respected properties constraints.The longest common subsequence (LCSS) problem can be leveraged to identify similar trajectories.Mainly the LCSS between two trajectories is defined as : Different alignments between the two sequences are tested to get the length of the LCSS.[5] identified two points as similar if they respect both a time and distance threshold.Furthermore, they defined the similarity as S(Tr 1 , Tr 2 ) = LCSS(Tr 1 ,Tr 2 ) max(Tr 1 .size,Tr 2 .size) .Aiming to get high similarities, we weaken this constraint into considering either the time or distance.Note that the distance constraint is in meters and the points are defined by their degrees.Hence, we construct from this distance constraint two thresholds [26]: The points coordinates are converted into radians and their latitude and longitude differences are compared with the supra thresholds.
The recursive nature of the LCSS problem is exponential.An alternative based on memorization has been proposed to resolve the problem in quadratic time and space.Other heuristics have been proposed to give an approximated solution in less time.Still, we stick to the exact approach of [27] which resolves the problem in quadratic time and linear space.The algorithm is described in Algorithm 1.Note that the algorithms in this paper are written in a pseudocode related to the Scala language because the formal way will not be sufficient to express all the functional features.

Materials and Methods
The project is implemented in Scala, aiming to extract clusters from massive trajectory logs stored in HDFS and persist them in MongoDB.Spark is leveraged to conduct the work in a distributed manner.RDD distributed across the memory and disk are being created and the work's stages are being achieved in parallel by distributing tasks across worker nodes.In particular, the system must provide the possibility to employ either the smart or semi-naive algorithm for computing the transitive closure; the choice of applying either the max-min, max-delta and max-product t-norms for transitivity.Moreover, the max-delta and max-product t-norms need a specific kind of handler.Theorem 2. By considering the same supposition as the Theorem 1.The preposition x + y − 1 ≥ α is not necessarily true, and it holds only if α = 1.
Proof of Theorem 2. Let us suppose that x = y = α.This is equivalent to The last statement is equivalent to α ≥ 1, and this holds only if α = 1 because by our supposition α ≤ 1. Consequently for α ∈ [0, 1[, x + y − 1 ≥ α is not always true.
These two theorems proof that at each step of computing the t-norms, we must filter the norms inferior to the chosen α-level.To provide all these possibilities, we present the Figures 1 and 2 explained in the next subsection.

The Project's Class Diagrams
The project integrates the Scala Stackable Trait Pattern.In detail, we provide an abstraction called AbstractMatrix.It is an abstract class holding the abstract definition of all the operations needed to achieve the transitive closure.Then, different traits override specific abstract definitions to customize the behavior.Traits extend java interfaces to provide much richer functionalities.In our case, traits extend the AbstractMatrix with specific behavior; they are denoted mixins, e.g., the MaxMinNorm overrides the norm function into computing the minimum of the inputs and the Seminaive mix overrides the closure algorithm definition.They can only be extended by classes already extending the AbstractMatrix.Also, the AbstracMatrix has a companion object defining a specific KeyEntry Type and a static ireverse operation which reverses the Similarity class entries for the closure's composition join.The Similarity class resembles Spark's MatrixEntry case class in the Spark MLlib library.However, it is not a case class and it overrides both the equals and hashCode operations to provide specific comparison behavior.The rationale behind this class is that we observed an anomaly when computing the transitive closure.More specifically, the transitive closure computing time was very large.After observing the computing process, we found out that when subtracting the results of the past and present, iterations in the closure's similarities were not considered equal because of their weights.Although, the weights differed from each other in a precision of 10 −9 .This fact led us to provide our own comparison strategy with a custom precision of 10 −3 integrated in the Similarity class.The abstraction is extended by another trait DefaultMatrix.Like the name suggests, the trait provides default behavior for the abstract definition.This has made it possible to leverage the abstract operations without defining a default behavior for each class.As the Figure 2 shows, the class SiMatrix can leverage the AbstractMatrix operation without having to override them.This is our approach extension of the stackable trait.Furthermore, Scala traits support self types.Meaning that a trait can only be extended by a subclass of the specified type.Which is why we implemented a MatrixImpl, a self type of the DefaultMatrix, adding a restriction to the inheritance strategy.The concrete class SiMatrix extends both the DefaultMatrix and the Serializable traits to respectively be able to leverage all the AbstractMatrix traits and to execute spark stages.Note that in the other traits, we did not override the composition operation because we provided a concrete definition in the lowest level SiMatrix class.To create the SiMatrix objects, we needed a factory pattern.Fortunately, the factory pattern is simplified in Scala by companion objects (A companion object of a specific class in Scala has the same name as the class and it can access the private variables and operations of the class).The SiMatrix is a companion object providing the factory method for instantiating its relative class.An extract of the apply method is described in Algorithm 2. The function is a higher order function based on currying.The first function takes as input the distributed similarities and outputs a function that takes the alpha threshold used for the partitioning, the chosen t-norm and transitive closure algorithm.Then, depending on the t-norm and closure algorithm it instantiates the SiMatrix class while specifying the traits to be mixed in.Note that the order of the trait is important.Scala is based on linearization principle, the last trait is the first to be extended.Hence, the name stackable trait pattern.It is like a stack of patterns where the first one to be extended is actually the last.The apply method of the SiMatrix object leverages the flatten function to create a reverse duplicate of the entries and return an iterator of the couple which will be flattened by Spark's flatMap operation.Now, for computing the closure, we specified two traits Seminaive and Smart related respectively to the semi-naive and smart algorithm.The difference between the two is that at each iteration the first considers the initial relation when computing the composition, while the second considers the same relation related to the current iteration.This led us to define a curried function for each trait with default parameters.In the Seminaive trait Algorithm 3, we specify that Rx equals the predefined Rx in the AbstractMatrix, which gets overridden in the SiMatrix class.The last function in the closure curried form computes the composition of the past specified parameters Ri and Rx.This reflects the high potential of Scala's higher order function into simplifying highly complex computations.At the heart of the closure function, we check if the current Rj equals the past iteration Ri.If the preposition holds, we return Ri; otherwise, we call the closure again in a recursive manner with Ri as Rj and the other parameters as default ones.As for the Smart trait Algorithm 4; we change Rx into a KeyEntry value returned by calling the ireverse function on the first parameter Ri.Note that to compute the transitive closure, we consider the type KeyEntry which reflects the RDD[(Long,(Long, double))] type.The first relation gets its columns extracted and joined with the rows of the second relation which are extracted by the ireverse function.The composition is overridden in the SiMatrix class (please see the Algorithm 5).In particular, the composition calls the handle function which in the same manner calls the norm function.Hence, the order of the mixins is very important.The order of the comparison is important to reduce additional time overhead.At last, the closure can be accessed via the closure_edges immutable lazy variable.In particular, the laziness reflects the Lazy initialization Pattern where the variable gets instantiated on access.The overall pipeline of the process is described in the next subsection with activity diagrams.

The Overall Execution Pipeline
The activity in Figure 3 reflects the overall process.All the actions in the diagram are distributed.First, we leverage Spark's wholeTextFiles operation to read the different directories.Also, we specify the number of partitions.While experimenting, we observed that if we specify n partitions we get 2*n ones.After reading the files, we use the Spark's map operation to retrieve the trajectory from each file.Spark's map function is a higher order function taking as a parameter another function which gets specified by us. Figure 4 illustrates the trajectories' creation process.Each file is handled in a distributed manner; however, each line of the files is handled sequentially to create instances of the PointEntry class.Afterward, we evaluated the system and we observed a relatively large time overhead.This led us to specify our own custom partitioning strategy based on the trajectories' sizes.The strategy is implemented in the LoadPartitioner class which extends Spark's Partitioner abstract class.The partitioner has to override the functions numPartitions and getPartition which return the id of the chosen partition.Details are in Algorithm 6, the main idea behind the partitioning strategy is to return the partition with the lowest size.The size of the partitions is not defined by the number of the trajectories, but the size of the points in the trajectories.The efficiency of the logic is provided in Section 5.

Results and Discussions
In the experiments, we deployed a cluster of 6 workers.The master and a worker have 8 cores and respectively 8 and 6 GB in the RAM, while the others have 4 cores and 4 GB memory.In the configuration, we specified the spark.driver.memoryand spark.executor.memoryto respectively equal 4000 MB and 2000 MB.First, in the experience we observed a large set of StackOverFlowError exceptions.This is explained by the recursive call of the lcss in the executors.To resolve this issue, we increased the maximum stack size in the executors by setting an extra java option in the executors' configurations with spark.executor.extraJavaOptionsset to -Xss40M.This increases the default maximum stack size into 40 MB.
While evaluating the performance of the conventional pipeline, we observed a large time overhead.When we checked the metrics in the SparkUI, we observed a large disturbance in the tasks visible in Figure 6.This fact is explained by the uneven complexity of the trajectories.To resolve this issue, we proposed our own partitioning strategy already explained in Section 4. After integrating the LoadPartitioner strategy, we observed the balanced workload illustrated in Figure 7.Moreover, we evaluated the performance of the system and plotted the results in Figure 8.The results provide the empirical evidence of the high efficiency of our partitioning strategy in reducing the computation time by equally balancing the workloads.After conducting the filter of the similarities, we additionally observed a time overhead.The SparkUI reflected the Figure 9.In the figure, we observed a large number of tasks which are executed in a very short time, a matter of milliseconds.Hence, each partition contains a very limited number of elements compared to the past load before the filter.To resolve this additional time load, we conducted a repartitioning just after the filter and observed the difference in Figure 10.The execution time of the two possibilities is plotted in Figure 11.We acknowledge that both the partitionBy and repartitioning stages may cause an additional overhead because of the shuffle behavior.But over unbalanced loads and after a large filter operation, they are indispensable to reduce the execution time.In particular, the repartitioning strategy after the filter as can be seen in Figure 10 reduces and reassembles the partitions into a single executor.This fact highly reduces additional NetworkIO overhead.

Conclusions
We provide a pipeline for clustering trajectories by resolving the LCSS problem.The clustering is achieved via the max-min transitive closure of the fuzzy proximity relation.Clusters are yielded based on a predefined α-level.Spark is exploited to distribute the execution of the pipelines' tasks.Details about the implementation of the different stages are yielded.To reduce the execution time, we studied the repartitioning effect over the conventional pipeline.Then, we proposed a new partitioning logic, which resolves the load balancing issue and integrated the repartitioning after the filtering of the similarities.Our results prove the efficiency of our partitioning strategy and the repartitioning in reducing the computation time.Moreover, we integrated the asynchronous Reactive MongoDB Driver as a first step into studying the behavior of spark in the asynchronous boundaries.We acknowledge that this integration needs further evaluation.In addition, we are currently working on two projects.The first aims to integrate a fully distributed algorithm for defining the clusters in the max-product and max-delta fuzzy similarity relations.The second is to employ a new iterative algorithm that exploits Spark's GraphX integration of Pregel.This last will be able to highly reduce the time overhead for computing the transitive closure of a fuzzy subjective similarity relation.

Definition 3 .
The fuzzy similarity relation R can be extracted by achieving a series of compositions.A composition has the form R • R = max y

Figure 1 .
Figure 1.The project's main abstract class.

Figure 2 .
Figure 2. The implementation of the AbstractMatrix class.

Figure 3 .
Figure 3.The execution pipeline main activity diagram.

Figure 4 .
Figure 4.The creation of the trajectories.

Figure 5 .
Figure 5.The join and persistence process.

Figure 6 .
Figure 6.The unbalanced tasks work load.

Figure 7 .
Figure 7.The tasks load after integrating the LoadPartitioner strategy.

Figure 8 .
Figure 8. Execution time before and after integrating the LoadPartitioner strategy.

Figure 9 .
Figure 9.The workload without the repartitioning after the similarities filter.

Figure 10 .
Figure 10.The workload after the repartitioning.

Figure 11 .
Figure 11.The execution time without and with the repartitioning stage.