1. Introduction
Recent technological developments have led to a
data deluge [
1], a scenario where more data are generated than can be successfully and efficiently managed or capped. This results in missed chances to analyze and interpret data to make informed decisions.
When decision making calls for pattern discovery, the complexity is further expanded if the data have spatio-temporal features, because traditional algorithms are not meant to handle the search for correlations which have a time dimension. This is, for example, the case of global positioning systems [
2] and geographic information systems [
3], which can be represented as
spatio-temporal databases (STDBs)—that is, extensions to existing information systems that include time to better describe a dynamic environment [
4].
The exploitation of STDBs can provide valuable knowledge, for instance, in the context of road traffic control and monitoring [
5], weather analysis [
6], and location-based sociological behavior in social networks [
7]. However, as stated above, traditional data mining techniques cannot be directly applied to STDBs, which complicates not only data exploitation but also processing times.
We are interested in the discovery of periodic patterns, which can be seen as events occurring with a certain “periodicity”—for example, the subway’s arrival at Central Park Station every 15 min defines a periodic pattern. A period corresponds to any unit of time, such as hours, days, weeks, et cetera. To be precise, a period is the time elapsed between two occurrences of a pattern, and it can be counted in terms of time or a number of transactions.
Sequential pattern mining is also concerned with finding statistically relevant patterns where data appear in a sequence [
8]. The sequence is analyzed in such a manner that the possible patterns satisfy a minimum threshold while considering the length of the periods to be analyzed. From the point of view of performance, the discovery of valuable knowledge depends on two aspects: the volume of data and the processing power. Hence, in a context where data grow exponentially, it is critical to ensure the use of efficient algorithms, regardless of the available processing power.
Problem Definition
Let be a spatio-temporal object defined by a point in time t and a spatial location . A change in the shape of the object or in the object’s location is known as an event. We will denote an event as , where is the object at a location and a time . For simplicity, the space where the objects are located is segmented into a set of disjoint cells with equal sizes. A cell is denoted as , and a sequence of localized events for the object is denoted as . Events belonging to take place over a time series , such as , where .
Definition 1. Given a minimum support provided by a user, then is a p-periodic pattern if and only if , such that the length of X is p and p corresponds to the period. X is a p-periodic pattern over if it satisfies two user requirements: p and . To illustrate this, consider the sequence:
where
= 1/3 and
p = 3. There are three subsequences, each containing three events. Thus, it is feasible to obtain the
p-periodic pattern
, which corresponds to a partial periodic pattern, because
can represent any event. This pattern is also a perfect periodic pattern, as it appears across all three subsequences.
The main contributions of this paper are as follows:
Extensive experimentation: To the best of our knowledge, no previous work has compared, empirically, the performance of the most cited algorithms based on association rules, such as
Apriori [
9],
MS-Apriori [
10],
FP-Growth [
11],
PPA [
12], and
Max-Subpattern [
11]. Thus, we have conducted a comprehensive comparison of these algorithms over two STDBs—first, a synthetic one, then a real one. As part of our experiments, we have also included the
Minus-F1 algorithm [
13], which has been proven to achieve good results, and a new probabilistic version of it, which we have developed.
An efficient probabilistic algorithm: Although recent developments have produced several off-the-shelf libraries for pattern mining—for instance,
apyori [
14] is a library that implements the Apriori algorithm in
Python—our experiments have confirmed that the performance of the most well-known algorithms is not ideal for STDBs. Thus, we have developed a new, probabilistic version of the
Minus-F1 algorithm [
13], which we refer to as
F1/FP. This new algorithm allows for periodic pattern discovery in STDBs. As in the case of Minus-F1, F1/FP is an algorithm of
Las Vegas type [
15], which always provides the correct answer when searching for a pattern, and has a polynomial behavior matched with a better performance in STDBs.
Complexity analysis: A calculation of the complexity of the F1/FP algorithm. The complexities of association rule algorithms have not been discussed sufficiently in the literature. Indeed, we have struggled to find sources where this kind of analysis is undertaken. Thus, we have endeavoured to prove that the complexity of our newly proposed algorithm is better than that of the alternatives.
We expect our work to contribute significantly towards future research on pattern searching, especially in the case of the exploration of massive datasets—such as those required for the mining of astronomical data—and online streams which continue to grow uninterruptedly—such as those derived from social media.
The remainder of this paper is organized as follows:
Section 2 consists of a bibliographical review of pattern searching.
Section 3 presents the main algorithms based on association rules, and
Section 4 analyzes the complexity of our proposal.
Section 5 reports on the experimental environment and
Section 6 introduces our results. Lastly,
Section 7 offers our conclusions and comments on future work.
2. Related Work
There are three types of sequential pattern-mining algorithms: machine learning algorithms, algorithms based on mathematical techniques, and algorithms based on association rules. Machine learning algorithms require an objective function and a training dataset to define “correct” patterns [
16,
17]. This approach often involves a complex model selection process and hyperparameter tuning, which can be challenging for users who lack sufficient domain knowledge and experience. Thus, this approach is unsuitable for users who are not well versed in the intricacies of training and tuning machine learning models.
Algorithms based on mathematical techniques involve the utilization of the Fourier transform to calculate the circular autocorrelation [
18]. This allows customization. For instance, Khanna and Kasurkar [
19] addressed three types of periodicity—symbol periodicity, segment periodicity, and partial periodicity—by proposing corresponding variants of an algorithm based on autocorrelation. Methods based on mathematics are also robust against noise and efficient at extracting partial periodic patterns, without additional domain knowledge. Regrettably, they prioritize computational efficiency by employing approximations, which may miss some periodic patterns [
13]. In other words, mathematical methods trade off the guarantee of finding all the qualifying patterns for faster execution times.
Association rule mining algorithms are those derived from the Apriori-based association rule proposed by Agrawal and Srikant [
9]. These algorithms exploit the fact that “any superset of an infrequent item set is also infrequent”. Indeed, Apriori identifies frequent item sets from smaller to larger candidates by pruning infrequent ones to prevent an explosion of the number of combinations to be examined.
Even though Apriori remains a well-regarded algorithm [
20], it has limitations. First, it only allows for a single
minimum support (MS), which can restrict its scope. Second, its efficiency may be lacking in certain situations. To address the first drawback, the MS-Apriori algorithm [
10] has been developed to enable the discovery of frequent patterns across multiple thresholds. To address the second drawback, optimization strategies have been used to take advantage of the inherent properties of periodic pattern mining [
21,
22]. For example, it is not necessary to assess the frequency of an item set in position
t if it is not frequent at any position contained within the cycles involving
t. Also, other researchers have looked into algorithms that use properties specific to the types of patterns they are interested in, for instance, partial periodic patterns [
23], asynchronous periodic patterns [
24], symbol periodicity, sequence periodicity, and segment periodicity [
25].
Spatio-temporal databases are another area which extends the scope of the problem with many new applications, such as disease diffusion analysis [
26], user activity analysis [
27], and local trend discovery in social networks [
28,
29]. Several approaches have been proposed to deal with spatial information [
30], treating it as a continuous variable [
31,
32], formulating it as a dynamic graph mining problem [
33], and encoding spatial features as discrete symbols [
13]. We have adopted the discrete symbol encoding approach to fully exploit our former research on sequential periodic pattern mining [
13].
Han et al. [
11] proposed the Max-Subpattern Hit-Set algorithm, often referred to simply as Max-Subpattern. They based their development on a custom data structure called a
max-subpattern tree to efficiently generate larger partial periodic patterns from combinations of smaller patterns. Yang et al. [
12] proposed the projection-based partial periodic pattern algorithm (PPA), derived from a strategy to encode events in tuples. The empirical results show that the PPA algorithm is better at discovering partial periodic patterns than Max-Subpattern and Apriori. Han et al. [
34] also proposed another algorithm called
partial frequent pattern growth (PFP-Growth).
PFP-Growth has two stages: the first stage constructs an FP-tree, and the second stage recursively projects the tree to output a complete set of frequent patterns. Experiments were carried out comparing PFP-Growth with the Max-Subpattern algorithm on synthetic data. Results show that PFP-Growth performs better than Max-Subpattern.
Then, Gutiérrez-Soto et al. suggested the Minus-F1 algorithm in 2022 [
13]. This is an algorithm designed specifically to search for periodic patterns in STDBs. Gutiérrez-Soto et al. showed that Minus-F1 has a polynomial behavior, which makes it more efficient than other alternatives, such as Apriori, Max-Subpattern, and the PPA. Recently, Gutiérrez-Soto et al. [
35] proposed an alternative called
HashCycle to find cyclical patterns. Although highly relevant, HashCycle is not appropriate for periodic pattern discovery.
Xun et al. [
36] proposed a new pattern called a
relevant partial periodic pattern and its corresponding mining algorithm (
PMMS-Eclat) to effectively reflect and mine the correlations of multi-source time series data. PMMS-Eclat uses an improved version of Eclat to determine frequent partial periodic patterns and then applies the locality-sensitive hashing (LSH) principle to capture the correlation among these patterns [
37].
Jiang et al. [
38] addressed the discovery of periodic frequent travel patterns of individual metro passengers considering different time granularities and station attributes. The authors proposed a new pattern called a “periodic frequent passenger traffic pattern with time granularities and station attributes” (PFPTS) and developed a complete mining algorithm with a PFPTS-Tree structure. The proposed algorithm was evaluated on real smart card data collected by an automatic fare collection system in a large metro network. As opposed to Jiang et al., our work can be applied in different situations rather than specifically on individual travellers contexts.
Whilst existing algorithms have been designed to handle various aspects of periodic pattern mining and spatio-temporal data, they often focus on optimizing computational efficiency or addressing specific pattern types. In contrast, our work presents a novel probabilistic variant of the Minus-F1 algorithm that aims to balance efficiency and effectiveness in a wide range of scenarios. The proposed algorithm is exhaustively evaluated against most of the previously mentioned algorithms using two datasets with diverse characteristics, showcasing its ability to handle different types of periodicity and data distributions. By conducting a comprehensive comparative analysis, we will highlight the unique contributions and advantages of our probabilistic variant of Minus-F1.
3. Algorithms
Sequential pattern mining is concerned with finding statistically relevant data patterns where the values appear in a sequence [
8]. Several algorithms have been designed for this purpose, and we want to compare our newly suggested alternative with the most well-regarded options, namely, Apriori, Max-Subpattern, PPA, Minus-F1, and FP-Growth. We will describe these options below and illustrate our explanations with examples.
3.1. Apriori
Apriori is an algorithm for frequent item mining on relational databases [
9]. It identifies items retrieved frequently in a database and creates a set containing such items. Over time, the set becomes larger, as items continue to be added if they are retrieved often. These sets can later be used to establish association rules [
39], which highlight trends in the database. Although Apriori is not originally designed to have a temporal dimension, we have amended it to include it.
Consider the following example. Let us assume that the string below represents a time series with periodicity four—the periodicity has been determined in advance. Note that each character in the string represents a separate event, and the events within curly braces are those that occur simultaneously.
Given that the periodicity of the time series is four, we can confirm that the number of periods is five. We have used hyphens to separate each period in the line below.
Apriori identifies the sets of frequent items by making subsequent passes through the database. In the first pass, it gathers the set of frequent items of size 1; then, in the second pass, the set of frequent items of size 2 and so on.
Let us call
the set of frequent items of size
k. Then, assuming a minimum support of 3,
can be derived from the following candidates:
Subsequently,
can be derived from the following candidates,
Finally, there is only one candidate for
,
The algorithm finishes when . Then, we finish with in this example, as the number of events in cannot generate an set.
3.2. Max-Subpattern
Max-Subpattern was originally proposed by Han et al. [
11] as an attempt to reduce the number of sets to determine periodic patterns [
40]. It builds as many trees as the number of periods we encounter in a time series, representing a sequence of events. However, period 1, which is equivalent to a period formed by a single event, is not taken into account. If a sequence has size
n, the maximum number of periods to evaluate is
. Thus, Max-Subpattern builds up to
trees.
Let us call the root of the tree. Then, for each set of candidates , there is a different . Also, each level of the tree will have subpatterns. For instance, if is formed by four events, the next level in the tree (Level 1) will be formed by four nodes, and each node will represent a subpattern composed of events. Then, Level 2 is formed by nodes with events whose ancestor belongs to Level 1. Each node is made up of at least two events, that is, without considering . Thus, the maximum height for each tree is .
Let us consider the same example used for Apriori in
Section 3.1. Once
has been determined,
is formed. Hence,
Then, we proceed to find subpattern hits, discarding all the matches with only one non-* element.
3.3. PPA
After discovering that Max-Subpattern spends a large amount of time calculating frequency counts from redundant candidate nodes, Yang et al. [
12] developed the projection-based partial periodic patterns algorithm—abbreviated as PPA—for mining partial periodic patterns with a specific period length in an event sequence.
The PPA starts by going over the time series which represents the sequence of events and splits it into partial periods of size
l. Afterwards, each event is codified—that is, the position of each event inside the partial period is recorded. Codified events can be seen as a matrix, where the first row corresponds to the first codified events and each column corresponds to the event’s position inside the partial periods. The matrix was referred to by Yang et al. as an
encoded period segment database (EPSD) [
12].
By following this approach, it is possible to count the instances of each event by column, and the result is used to check whether the events comply with the required support. Consider Apriori’s example defined in
Section 3.1. Specifically, consider a particular instance of the original example for Apriori, namely,
Consequently, the matrix is defined as follows,
where the element
corresponds to event
x in position
i. Once the instances of each event are counted by column, and the minimum support is satisfied, a candidate subsequence can be derived. Then, the events that form this subsequence are sorted, considering first the partial positions and then the lexicographic nomenclature of each event. The last subsequence
is equivalent to
. Indeed, according to Yang et al. [
12],
is used to look for the other
patterns. Each event of
is used as a prefix to obtain the patterns that comply with the minimum support over the EPSD. Finally, all the
sets that fulfil the minimum support are gathered.
3.4. Minus-F1
Minus-F1 operates by using two counters: one which is increased by 1 every time there is a match with the candidate pattern, and a second one which decreases until it reaches zero when the subsequence is consumed. In the first run of the algorithm, the sequence’s probability distribution is calculated—this can be seen as capturing the entropy of all the events in the sequence. To achieve this, Minus-F1 finds out how many times each event occurs. When an event occurs, its counter is decreased. Thus, when the counter reaches zero, we can confirm that it is unnecessary to keep looking for it—it can no longer occur.
The worst-case scenario for Minus-F1 happens when the events are distributed uniformly [
13]. In contrast, when the distribution is not uniform, the algorithm performs the pruning efficiently. To illustrate this, let us consider the following sequence
S, which comprises the subsequences
,
,
, and
, namely,
Note that all the subsequences have period 3. Assuming a minimum support of 2, the only two subsequences which satisfy the minimum support and form a partial pattern are and . In other words, is the only partial pattern.
Once the subsequences and have been consumed, it makes no sense to continue searching for them—in our example, the events a and b cannot occur in and . Hence, we can prune the search space.
It appears that Minus-F1 is mainly affected by the size of the period [
13], as opposed to the number of patterns found, which is different to the rest of the algorithms reviewed here. In fact, Minus-F1 goes through the entire sequence of events once for each period under consideration. Thus, Gutiérrez-Soto et al. [
13] have pointed out that Minus-F1’s best performance is achieved as the minimum support tends to zero. We aimed to fix this issue in the new algorithm that we are proposing.
3.5. FP-Growth
FP-Growth was designed to derive sets of frequent items from sequences without a pre-defined period. The algorithm begins by creating a table comprising the frequent items which satisfy the minimum support. Then, the table is sorted in descending order.
Let us say that the items which satisfy the minimum support are as follows:
Then, FP-Growth removes from the items the segments that do not satisfy the minimum support and separates them to search for partial patterns. Finally, the patterns are sorted according to the position they have in the original segments. In the case of our example, the results are displayed in
Table 1:
3.6. Minus-F1’s (Probabilistic Version)
Our version of Minus-F1, which we have called F1/FP, is a Las Vegas type of algorithm, which always provides the correct answer. This means that its performance in the worst-case scenario corresponds to the deterministic algorithm’s performance. Note that this situation arises only when the probability distribution of the algorithm’s input data reaches the worst case. Although this is uncommon, it depends on the probability distribution. Therefore, the time complexities for this type of algorithm are expressed as expected time, denoted by .
F1/FP operates similarly to Minus-F1, except that, when searching for subsequences, these are selected randomly, assuming their occurrence likelihood follows a uniform distribution. This can be seen in Line 9—the swap procedure—of Algorithm 1, where we have listed the pseudo-code for F1/FP to illustrate our explanation. This simple modification of Minus-F1 provides a better performance. It is worth noting that the literature has plenty of such subtle improvements, which result in better performances and running times.
Algorithm 1 Probabilistic Minus-F1 |
- Require:
- Ensure:
- 1:
- 2:
for
do - 3:
for do - 4:
- 5:
end for - 6:
end for - 7:
for do - 8:
for do - 9:
- 10:
for do - 11:
- 12:
- 13:
- 14:
end for - 15:
end for - 16:
for do - 17:
for do - 18:
- 19:
if then - 20:
- 21:
- 22:
end if - 23:
end for - 24:
end for - 25:
- 26:
end for
|
4. Time Complexity
To show how random swaps affect the running time, determined by its expected value, we provide the following definitions:
Definition 2. Let be a function determining the occurrence of subsequence within the sequence S, such that: Definition 3. Let be the probability of choosing some subsequence within sequence S, such that its position can be between and , where is the number of sequences to carry out a swap ().
We assume that all subsequences have the same probability of being selected—in other words, a uniform distribution is assumed. Thus,
is defined as follows:
Definition 4. Given a random subsequence , whose position within S is i, the expected value to carry out a swap is defined as: Lemma 1. The number of swaps carried out by the probabilistic version of Minus-F1 is given by the number of subsequences .
Therefore, the time complexity to carry out a swap is given by:
Proof by Induction: Base case (when
): Using the loop invariant in Lines 2–3 in Algorithm 1, we notice that there is a swap, which is equal to
. Note that
S has two events. Then, the random event is chosen from the first event of the sequence. Thus,
Therefore, .
Inductive steps: This case is provided when
, for any
k-iteration from
until
, such that
. Using the loop invariant in lines 2–3, there are always
swaps. Without loss of generality, it is expressed as
Although our random procedure provides notable improvements in running times, its time complexity does not change in general. This is because the sequence’s length is n, and it must run through all subsequences of length over p periods. Consequently, running all subsequences takes . Without loss of generality, and given that all algorithm loops operate on subsequences chosen randomly, a portion of this version can be denoted as , instead of using , except for the loops between lines 2 and 8—note that these loops are related to m events. Thus, since Minus-F1’s time complexity is ; this probabilistic version can be characterized as , which is bounded by . □
5. Experimentation
To check the algorithms’ performance, two datasets were used. The first one is composed of synthetic data, and it was used to corroborate that each algorithm was implemented correctly—that is, to confirm that each algorithm was able to find the required patterns. Once correctness had been verified, we used a second dataset to confirm that the algorithms could handle real data. The second dataset is a sample of the Geolife GPS trajectory dataset [
41].
Geolife records a broad range of users’ outdoor movements, including daily routines—going to work or returning home—and activities like travelling to entertainment, shopping, and sport activities [
41]. Geolife has been widely used in mobility pattern mining and location-based social networks [
26], which are potential applications for our work. Therefore, we thought this dataset would fit our experimentation adequately.
The Geolife dataset comprises GPS trajectories undertaken by 182 people over a period of three years—between April 2007 and August 2012—and it was collected by Microsoft Research Asia. Each GPS trajectory is represented by a sequence of time-stamped points labelled by latitude, longitude, and altitude.
To characterize Geolife as an STDB for our experiments, the space was represented by a set of cells forming a grid. The location of each object within the grid was determined by its latitude and longitude. Time was modelled as a timestamp. At timestamp 0, all the objects are situated in their initial positions. Subsequently, objects move to different positions across the grid. An object’s motion was characterized as a contiguous sequence of characters, facilitating pattern searching within the sequence.
It should be observed that our representation of motion can have an impact on pattern detection only if movement occurs within a time window whose granularity is smaller than what has been represented. For instance, if we were measuring time in minutes, we could lose some patterns occurring within seconds. However, this is not the case. The efficiency of the algorithms considered here does not depend on the granularity of the grid, but on the length of the sequence.
The results displayed below correspond to the average of five executions for each experiment. From an empirical perspective, the performance of each algorithm is determined by its running time. To define a pattern, a range of 2 to events was considered, where n represents the length of the sequence. This implies that all patterns consist of at least two events—at least one event repetition—occurring up to half the length of the sequence. For a pattern to be valid, it must occur at least twice within the sequence.
All experiments were limited to a maximum of 3 h—results exceeding this length are not shown. The experiments were carried out on a server equipped with an Intel Xeon Processor E3-1220 at 3.00 GHz and 16 GB of RAM operating at 2133 MHz with a 1 TB 7200 RPM hard drive, and running under Linux (Debian Jessie 8.4).
Table 2 displays the abbreviations used later in our results to refer to the different algorithms.
5.1. Results Derived from the Synthetic Dataset
The experiments contemplated sequences of size 500, 750 and 1000, considering periods of 4, 8, 16, and 20. To link the running times with the corresponding computational complexities for each algorithm, two experiments were performed. The experiments cover pattern searching over the synthetic database which has a full pattern with period 48 and is repeated until achieving the sequence size. Supports for 25% (
Table 3,
Table 4 and
Table 5), 50% (
Table 6,
Table 7 and
Table 8), and 75% (
Table 9,
Table 10 and
Table 11) were considered.
5.2. Results Derived from the Geolife Dataset
We chose data samples from three Geolife users—Users 0, 1, and 2—and we reviewed these samples manually to confirm the presence of periodic patterns. Different size sequences—500, 750, and 1000—and periods—10, 15, 25, 50, and 100—were considered. The 500 dataset was built from User 0’s records, the 750 dataset was built from User 1’s records, and the 1000 dataset was built from User 2’s records. Supports for 25% (
Table 12,
Table 13 and
Table 14), 50% (
Table 15,
Table 16 and
Table 17), and 75% (
Table 18,
Table 19 and
Table 20) were considered.
6. Discussion
As mentioned previously, timestamped data on the grid are mapped to a character string representing a sequence. All the algorithms which we have included in our research operate on such sequences, and both their performance and scalability depend solely on the sequence’s length and the minimum support. Our results are independent of the size of the grid and the configuration of its cells. Thus, the impact of the mapping is not considered here. There is, on the other hand, a separate body of research that studies indexing and searching methods in spatio-temporal databases. These works are based on indexing structures such as the
r-tree and its variants [
42], namely, HR-tree and MVR-tree to mention a couple. Given that such works depend on these structures, it is not possible to compare them directly with the association rule algorithms that we have described here.
6.1. Synthetic Data Results
Table 3,
Table 4 and
Table 5 show the experimental results over the synthetic STDB. In these tables, the minimum support was set to 25%. The sequence length in
Table 3 is 500, in
Table 4 is 750, and in
Table 5 is 1000.
Even though our experiments were limited to a maximum of three hours, they shed light on the algorithms’ performance. In
Table 3, the best results for average processing time are provided by F1/FP (6.4 ms), followed by F1 (26.6 ms), M-SP, FP-G, and PPA. The first results are in line with the standard deviation presented by the first two algorithms—1.85 for F1/FP and 17.02 for F1. The worst results were by APR and MSA. These two algorithms also had the worst standard deviations—2.32
for APR with an average time of 1.74
ms, and 9.20
for MSA with an average time of 9.20
ms.
Table 4 reflects the same behavior as
Table 3, maintaining the same order of less and more efficient algorithms in terms of processing time. According to
Table 5, it is possible to see the same performance trend for both the least and most efficient algorithms. Whenever the sequence length was increased in
Table 3 and
Table 4, the processing times also increased for all the algorithms.
Table 6,
Table 7 and
Table 8 present the results considering a minimum support of 50%. In
Table 6 the sequence length is 500, whereas in
Table 7 is 750, and in
Table 8 is 1000. In
Table 6, the worst performance is by MSA, with an average time of 1.0
ms and a standard deviation of 3.99
, followed by PPA, whose average time was 1.77
ms with a standard deviation of 3.52
. Conversely, the best times are provided by F1/FP and F1. F1/FP has an average of 4.8 ms and a standard deviation of 1.16, while F1 has an average of 20.6 ms with a standard deviation of 13.23.
Table 7 exhibits the same behavior as
Table 6—that is, the same order of performance for the two most efficient and the two least efficient algorithms. The worst average time in
Table 8 was recorded by PPA (2.82
ms), while its standard deviation was 6.79
. The second worst average—2.82
—was recorded by M-SP, while the second worst standard deviation—4.79
—was presented by MSA.
It is worth noting that the PPA is particularly affected when the period is 12 in
Table 6,
Table 7 and
Table 8, as both its average time and standard deviations increase. However, the PPA is not the only one affected. All algorithms are impacted negatively by the same period, except for F1/FP and F1. This peculiarity with period 12 could be attributed to how the pattern is formed, as both AP and MSA are not affected as much as the PPA. Following the same trend observed in
Table 3,
Table 4 and
Table 5, the best average time along with the best standard deviation is yielded by F1/FP and F1.
Table 9,
Table 10 and
Table 11 display the results considering a minimum support of 75%, with sequences of lengths 500 (
Table 9), 750 (
Table 10), and 1000 (
Table 11). In
Table 9, the worst average time was yielded by M-SP with 442.2 ms, while the second worst was from APR—41.6 ms with a standard deviation of 24.62. Remarkably, the standard deviation for F1 (13.141) was the second worst. The most efficient algorithm was the PPA, with an average of 3.8 ms and a standard deviation of 1.30. The second most efficient one was MSA, whose average time was 4 ms. Note that FP-G registered the lowest standard deviation, with a value of 0.707.
As in the case of
Table 9, the less efficient algorithms in
Table 10 are M-SP, with an average of 1.22
ms, and a standard deviation of 44.7. APR presents an average time of 46.8 ms with a standard deviation of 27.36. FP-G exhibits the lowest standard deviation, and the PPA proves to be the most efficient with an average of 4.4 ms and the second-lowest standard deviation of 1.14. F1/FP offers the second-best average—that is, 4.8 ms—and the third-best standard deviation of 1.30. Notably, F1 continues to have better time average and standard deviation than APR and M-SP.
Finally,
Table 11 shows the same trend as
Table 9 and
Table 10. The highest standard deviation was displayed by M-SP at 2.78
, and the lowest time average was exhibited by F1/FP at 5.4 ms, followed by the PPA at 5.6 ms. Note that F1/FP presents a high standard deviation, though it is negligible in comparison with the PPA, and F1 has better averages than M-SP and APR.
To summarise, from this set of experiments, we can appreciate that every time the sequence length is increased, the processing time also increases. In addition, whenever the support is raised, all algorithms tend to reduce their average time and their standard deviation, which implies lower processing times for each of them. PPA, MS-P, and FP-G greatly benefit from the minimum support being increased. On the other hand, F1/FP and F1 exhibit a scalable performance that is independent of both increments, the minimum support and the sequence length. This is particularly notable in comparison with the performance of the other algorithms, especially when the support is low.
6.2. Real Data Results
Table 12,
Table 13 and
Table 14 display experimental results on the real dataset. In these three tables, the minimum support was set to 25%. Specifically, the sequence lengths are 500, 750, and 100, respectively, for each table. In
Table 12, three algorithms exceed the maximum of three hours, particularly when the periods are 50 and 100. These algorithms correspond to APR, MSA, and PPA, which present the highest averages for time along with their corresponding standard deviations—that is, replacing “-” with three hours in milliseconds. According to the results of this table, the most efficient algorithms are F1/FP with an average of 19.6 ms and F1 with 454 ms. Their standard deviations were 12.11 and 577.23, respectively.
Table 13 exhibits the same behavior as
Table 12, maintaining the same positions for the least efficient algorithms in terms of running time, particularly when the period is 100. Following the same pattern as
Table 12, F1/FP and F1 had the lowest averages and standard deviations. Continuing this trend,
Table 14 provides the same rankings for the best and worst averages along with their standard deviations.
Four algorithms—APR, MSA, PPA, and FP-G—exceeded the time limit in
Table 15. These four algorithms yielded the highest standard deviations. Conversely, the lowest averages and standard deviations corresponded to F1/FP and F1. No algorithm exceeded the time limit in
Table 16. However, the highest averages were provided by APR (3.25
ms with a standard deviation of 6.04
), followed by M-SP with an average of 1.60
and a standard deviation of 6.20
.
Two algorithms obtained the lowest averages: PPA with 22 ms and a standard deviation of 7, followed by F1/FP with 25.6 ms. As for
Table 17, three algorithms exceeded the time limit—APR, MSA, and PPA—when the period was 100. Also, note that these algorithms presented the highest standard deviations. The algorithm with the lowest average was FP-G—34 ms with a standard deviation of 10.99—followed by F1/FP—41.2 ms and a standard deviation of 31.956.
Table 18 is no exception to the fact that some algorithms exceeded the time limit, such as APR, MSA, and PPA, specifically when the period was 100. The lowest averages were given by F1/FP—15.4 ms with a standard deviation of 11.41—and F1—192.6 ms with a standard deviation of 221.14. As for
Table 19, no algorithm exceeded 3 h of processing. The highest average times were given by APR with 2.83
ms and a standard deviation of 5.16
. The second-highest times corresponded to M-SP with an average of 1.54
ms and a standard deviation of 568.33.
Finally,
Table 20 shows that no algorithms exceeded the maximum time limit. The lowest average time was provided by PPA—27 ms with a standard deviation of 6.782. The second best average time was for FP-G—32 ms with a standard deviation of 7.211.
Just as it happened with the synthetic dataset, every time the support was increased in the real dataset, the running times decreased, except for F1/FP and F1. Similarly, when the sequence length was increased, the running times also increased.
At first glance, the running times are higher on the real dataset than on the synthetic one. However, for both datasets, the running times decreased for the algorithms using minimum support every time the minimum support increased. F1/FP always showed remarkable running times. Indeed, this algorithm was always among the best ones. Incidentally, when the minimum support was increased, PPA and MS-P also achieved good results on the real dataset.
7. Conclusions
Mining periodic patterns became a topic of relevance in the 1990s, mostly after the development of the Apriori algorithm. Since then, the discovery of patterns has turned out to be one of the main techniques for characterizing data. Over the years, several improvements to the basic Apriori idea have been considered, focusing on larger and larger datasets as time has progressed, increasingly stressing the storage and processing capabilities of modern computers.
In this paper, we have presented F1/FP, a new probabilistic algorithm, which is guaranteed to find all the periodic patterns—it always returns the correct answer, as any Las Vegas algorithm. F1/FP does not require minimum support and is scalable.
Given that no previous work has compared the performance of the most well-regarded algorithms in this field, we endeavored to compare them through extensive experimentation, involving sequences of different lengths and various support thresholds. This has enabled us to gain a broader understanding of the performance of the most relevant algorithms, and we have confirmed that our proposal performs better than the existing alternatives. Our experiments allow us to derive the following observations:
F1/FP Performance: F1/FP has a robust performance not only on synthetic data but also on real data. F1/FP provides the best average results compared to all the other algorithms included in our study.
Apriori: The Apriori algorithm achieves the worst results for synthetic and real datasets.
PPA: When support is increased for real data, the PPA has a reasonably good performance. When support is low and data are synthetic, the PPA is not ideal.
MS-Apriori: The performance of MS-Apriori is remarkably good on synthetic data.
FP-Growth The FP-Growth algorithm achieves a better performance than the PPA when support is increased.
Minimum support: Every time the minimum support is increased, all algorithms accomplish better processing times.
Performance on real data: Broadly speaking, all algorithms increase their average times when using real data.
Future Work
Although F1/FP always returns the correct answer and is scalable, we must continue to work on its runtime execution to ensure that it is as fast as possible. To improve the runtime in future research, we plan to exploit parallelism, including what we can obtain through a
MapReduce formulation [
43].
We also want to consider an extension of our work to manage continuous online streams. While our algorithm is guaranteed to handle such a challenge, it would be an interesting case study to utilize it to detect and classify events in real time as they are retrieved from social media and astronomical data streams.