Mining Sequential Patterns with VC-Dimension and Rademacher Complexity

: Sequential pattern mining is a fundamental data mining task with application in several domains. We study two variants of this task—the ﬁrst is the extraction of frequent sequential patterns, whose frequency in a dataset of sequential transactions is higher than a user-provided threshold; the second is the mining of true frequent sequential patterns, which appear with probability above a user-deﬁned threshold in transactions drawn from the generative process underlying the data. We present the ﬁrst sampling-based algorithm to mine, with high conﬁdence, a rigorous approximation of the frequent sequential patterns from massive datasets. We also present the ﬁrst algorithms to mine approximations of the true frequent sequential patterns with rigorous guarantees on the quality of the output. Our algorithms are based on novel applications of Vapnik-Chervonenkis dimension and Rademacher complexity, advanced tools from statistical learning theory, to sequential pattern mining. Our extensive experimental evaluation shows that our algorithms provide high-quality approximations for both problems we consider. using the concept of VC-dimension. We also devised efﬁcient algorithms to mine the true frequent sequential patterns using VC-dimension and Rademacher complexity. Our extensive experimental evaluation shows that our sampling algorithm for mining frequent sequential patterns produces accurate approximations using samples that are small fractions of the whole datasets, thus vastly speeding up the sequential pattern mining task on very large datasets. For mining true frequent sequential patterns, our experimental evaluation shows that our algorithms obtain high-quality approximations, even better than guaranteed by their theoretical analysis. In addition, our evaluation shows that the upper bound on the maximum deviation computed using the approximation of the Rademacher complexity allows to obtain better results than the ones obtained with the upper bound on the maximum deviation computed using the empirical VC-dimension.


Introduction
Sequential pattern mining [1] is a fundamental task in data mining and knowledge discovery, with applications in several fields, from recommender systems and e-commerce to biology and medicine. In its original formulation, sequential pattern mining requires to identify all frequent sequential patterns, that is, sequences of itemsets that appear in a fraction at least θ of all the transactions in a transactional dataset, where each transaction is a sequence of itemsets. The threshold θ is a user-specified parameter and its choice must be, at least in part, be informed by domain knowledge. In general, sequential patterns describe sequences of events or actions that are useful for predictions in many scenarios.
Several exact methods have been proposed to find frequent sequential patterns. However, the exact solution of the problem requires processing the entire dataset at least once, and often multiple times. For large, modern sized datasets, this may be infeasible. A natural solution to reduce the computation is to use sampling to obtain a small random portion (sample) of the dataset, and perform the mining process only on the sample. It is easy to see that by analyzing only a sample of the data the problem cannot be solved exactly, and one has to rely on the approximation provided by the results of the mining task on the sample. Therefore, the main challenge in using sampling is on computing a sample size such that the frequency of the sequential patterns in the sample is close to the frequency that would be obtained from the analysis on the whole dataset. Relating the two quantities using standard techniques (e.g., Hoeffding inequality and union bounds) does not provide useful results, that is, small sample sizes. In fact, such procedures require the knowledge of the number of all the sequential patterns in the dataset, which is impractical to compute in a reasonable time. So, one has to resort to loose upper bounds that usually result in sample sizes that are larger than the whole dataset. Recently, tools from statistical learning (e.g.,Vapnik-Chervonenkis dimension [2] and Rademacher complexity [3]) have been successfully used in frequent itemsets mining [4,5], a frequent pattern mining task where transactions are collections of items, showing that accurate and rigorous approximations can be obtained from small samples of the entire dataset. While sampling has previously been used in the context of sequential pattern mining (e.g., Reference [6]), to the best of our knowledge no sampling algorithm providing a rigorous approximation of the frequent sequential patterns has been proposed.
In several applications, the analysis of a dataset is performed to gain insight on the underlying generative process of the data. For example, in market basket analysis one is interested in gaining knowledge on the behaviour of all the customers, which can be modelled as a generative process from which the transactions in the dataset have been drawn. In such a scenario, one is not interested in sequential patterns that are frequent in the dataset, but in sequential patterns that are frequent in the generative process, that is, whose probability of appearing in a transaction generated from the process is above a threshold θ. Such patterns, called true frequent patterns, have been introduced by Reference [7], which provides a Vapnik-Chervonenkis (VC) dimension based approach to mine true frequent itemsets. While there is a relation between the probability that a pattern appears in a transaction generated from the process and its frequency in the dataset, one cannot simply look at patterns with frequency above θ in the dataset to find the ones with probability above θ in the process. Moreover, due to the stochastic nature of the data, one cannot identify the true frequent patterns with certainty, and approximations are to be sought. In such a scenario, relating the probability that a pattern appears in a transaction generated from the process with its frequency in the dataset using standard techniques is even more challenging. Hoeffding inequality and union bounds require to bound the number of all the possible sequential patterns that can be generated from the process. Such bound is infinite if one considers all possible sequential patterns (e.g., does not bound the pattern length). To the best of our knowledge, no method to mine true frequent sequential patterns has been proposed.

Our Contributions
In this work, we study two problems in sequential pattern mining-mining frequent sequential patterns and mining true frequent sequential patterns. We propose efficient algorithms for these problems, based on the concepts of VC-dimension and Rademacher complexity. In this regard, our contributions are: • We define rigorous approximations of the set of frequent sequential patterns and the set of true frequent sequential patterns. In particular, for both sets we define two approximations: one with no false negatives, that is, containing all elements of the set; and one with no false positives, that is, without any element that is not in the set. Our approximations are defined in terms of a single parameter, which controls the accuracy of the approximation and is easily interpretable.

•
We study the VC-dimension and the Rademacher complexity of sequential patterns, two advanced concepts from statistical learning theory that have been used in other mining contexts, and provide algorithms to efficiently compute upper bounds for both. In particular, we provide a simple, but still effective in practice, upper bound to the VC-dimension of sequential patterns by relaxing the upper bound previously defined in Reference [8]. We also provide the first efficiently computable upper bound to the Rademacher complexity of sequential patterns. We also show how to approximate the Rademacher complexity of sequential patterns.

•
We introduce a new sampling-based algorithm to identify rigorous approximations of the frequent sequential patterns with probability 1 − δ, where δ is a confidence parameter set by the user. Our algorithm hinges on our novel bound on the VC-dimension of sequential patterns, and it allows to obtain a rigorous approximation of the frequent sequential patterns by mining only a fraction of the whole dataset.
• We introduce efficient algorithms to obtain rigorous approximations of the true frequent sequential patterns with probability 1 − δ, where δ is a confidence parameter set by the user. Our algorithms use the novel bounds on the VC-dimension and on Rademacher complexity that we have derived, and they allow to obtain accurate approximations of the true frequent sequential patterns, where the accuracy depends on the size of the available data.

•
We perform an extensive experimental evaluation analyzing several sequential datasets, showing that our algorithms provide high-quality approximations, even better than guaranteed by their theoretical analysis, for both tasks we consider.

Related Work
Since the introduction of the frequent sequential pattern mining problem [1], a number of exact algorithms has been proposed for this task, ranging from multi-pass algorithms using the anti-monotonicity property of the frequency function [9], to prefix-based approaches [10], to works focusing on the closed frequent sequences [11].
The use of sampling to reduce the amount of data for the mining process while obtaining rigorous approximations of the collection of interesting patterns has been successfully applied in many mining tasks. Raïssi and Poncelet [6] provided a theoretical bound on the sample size for a single sequential pattern in a static dataset using Hoeffding concentration inequalities, and they introduced a sampling approach to build a dynamic sample in a streaming scenario using a biased reservoir sampling. Our work is heavily inspired by the work of Riondato and Upfal [4,5], which introduced advanced statistical learning techniques for the task of frequent itemsets and association rules mining. In particular, in Reference [4] they employed the concept of VC-dimension to derive a bound on the sample size needed to obtain an approximation of the frequent itemsets and association rules from a dataset, while in Reference [5] they proposed a progressive sampling approach based on an efficiently computable upper bound on the Rademacher complexity of itemsets. VC-dimension has also been used to approximate frequent substrings in collections of strings [12], and the related concept of pseudo-dimension has been used to mine interesting subgroups [13]. Rademacher complexity has also been used in graph mining [14][15][16], to design random sampling approaches for estimating betweenness centralities in graphs [17].
Other works have studied the problem of approximating frequent sequential patterns using approaches other than sampling. In Reference [18], the dataset is processed in blocks with a streaming algorithm, but the intermediate sequential patterns returned may miss many frequent sequential patterns. More recently, Reference [8] introduced an algorithm to process the datasets in blocks using a variable, data-dependent frequency threshold, based on an upper bound to the empirical VC-dimension, to mine each block. Reference [8] defines an approximation for frequent sequential patterns that is one of the definitions we consider in this work. The intermediate results obtained after analyzing each block have probabilistic approximation guarantees, and after analyzing all blocks the output is the exact collection of frequent sequential patterns. While these works, in particular Reference [8], are related to our contributions, they do not provide sampling algorithms for sequential pattern mining.
To the best of our knowledge, Reference [7] is the only work that considers the extraction of frequent patterns w.r.t. an underlying generative process, based on the concept of empirical VC-dimension of itemsets. While we use the general framework introduced by Reference [7], the solution proposed by Reference [7] requires to solve an optimization problem that is tailored to itemsets and, thus, not applicable to sequential patterns; in addition, computing the solution of such problem could be relatively expensive. Reference [19] considers the problem of mining significant patterns under a similar framework, making more realistic assumptions on the underlying generative process compared to commonly used tests (e.g., Fisher's exact test).
Several works have been proposed to identify statistically significant patterns where the significance is defined in terms of the comparison of patterns statistics. Few methods [20][21][22] have been proposed to mine statistically significant sequential patterns. These methods are orthogonal to our approach, which focuses on finding sequential patterns that are frequent with respect to (w.r.t.) an underlying generative distribution.

Preliminaries
We now provide the definitions and concepts used throughout the article. We start by introducing the task of sequential pattern mining and formally define the two problems which are the focus of this work: approximating the frequent sequential patterns and mining sequential patterns that are frequently generated from the underlying generative process. We then introduce two tools from statistical learning theory, that is, the VC-dimension and the Rademacher complexity, and the related concept of maximum deviation.

Sequential Pattern Mining
Let I = {i 1 , i 2 , . . . , i h } be a finite set of elements called items. I is also called the ground set. An itemset P is a (non-empty) subset of I, that is, P ⊆ I. A sequential pattern p = P 1 , P 2 , . . . , P is a finite ordered sequence of itemsets, with P i ⊆ I, 1 ≤ i ≤ . A sequential pattern p is also called a sequence. The length |p| of p is defined as the number of itemsets in p. The item-length ||p|| of p is the sum of the sizes of the itemsets in p, that is, Let U denote the set of all the sequences which can be built with itemsets containing items from I. A dataset D is a finite bag of (sequential) transactions where each transaction is a sequence from U. A sequence p belongs to a transaction τ ∈ D if and only if p τ. For any sequence p, the support set T D (p) of p in D is the set of transactions in D to which p belongs: T D (p) = {τ ∈ D : p τ}. The support Supp D (p) of p in D is the cardinality of the set T D (p), that is the number of transactions in D to which p belongs: Supp D (p) = |T D (p)|. Finally, the frequency f D (p) of p in D is the fraction of transactions in D to which p belongs: A sequence p is closed w.r.t. D if for each of its super-sequences y p we have f D (y) < f D (p), or, equivalently, none of its super-sequence has support equal to f D (p). We denote the set of all closed sequences in D with CS(D). Example 1. Consider the following dataset D = {τ 1 , τ 2 , τ 3 , τ 4 } as example: The dataset above has 4 transactions. The first one, τ 1  is not a subsequence of τ 1 because the order of the itemsets in the two sequences is not the same. Note that from the definitions above, an item can only occur once in an itemset, but it can occur multiple times in different itemsets of the same sequence. Finally, the sequence {6, 7}, {5} , whose frequency is 3/4, is a closed sequence, since its frequency is higher than the frequency of each of its super-sequences. Given a dataset D and a minimum frequency threshold θ ∈ (0, 1], frequent sequential pattern (FSP) mining is the task of reporting the set FSP(D, θ) of all the sequences whose frequency in D is at least θ, and their frequencies: In the first part of this work, we are interested in finding the set FSP(D, θ) by only mining a sample of the dataset D. Note that given a sample of the dataset D, one cannot guarantee to find the exact set FSP(D, θ) and has to resort to approximations of FSP(D, θ). Thus, we are interested in finding rigorous approximations of FSP(D, θ). In particular, we consider the approximation of FSP(D, θ) defined in Reference [8].

Definition 1.
Given ε ∈ (0, 1), an ε-approximation C of FSP(D, θ) is defined as a set of pairs (p, f p ): that has the following properties: (Note that while Reference [8] introduced the definition of ε-approximation of FSP(D, θ), it did not provide a sampling algorithm to find such approximation for a given ε ∈ (0, 1).) Intuitively, the approximation C contains all the frequent sequential patterns that are in FSP(D, θ) (i.e., there are no false negatives) and no sequential pattern that has frequency in D much below θ. In addition, C provides a good approximation of the actual frequency of the sequential pattern in D, within an error ε/2, arbitrarily small.
Depending on the application, one may be interested in a different approximation of FSP(D, θ), where all the sequential patterns in the approximation are frequent sequential patterns in the whole dataset.

Definition 2.
Given ε ∈ (0, 1), a false positives free (FPF) ε-approximation F of FSP(D, θ) is defined as a set of pairs (p, f p ): that has the following properties: The approximation F does not contain false positives, that is, sequences with f D (p) < θ. In addition, it does not miss sequences with f D (p) ≥ θ + ε and, similarly to the ε-approximation, we have that, for every pair in F , it gives a good approximation of the actual frequency of the sequential patterns in D, within an error ε/2, arbitrarily small.

True Frequent Sequential Pattern Mining
In several applications, the dataset D is a sample of transactions independently drawn from an unknown probability distribution π on U. In such a scenario, the dataset D is a finite bag of |D| independent identically distributed (i.i.d.) samples from π. For any sequence p ∈ U, the real support set T(p) of p is the set of sequences in U to which p belongs: T(p) = {τ ∈ U : p τ}. We define the true frequency t π (p) of p w.r.t. π as the probability that a transaction sampled from π contains p: In this scenario, the final goal of the data mining process on D is to gain a better understanding of the process generating the data, that is, of the distribution π, through the true frequencies t π , which are unknown and only approximately reflected in the dataset D. Therefore, we are interested in finding the sequential patterns with true frequency t π at least θ for some θ ∈ (0, 1]. We call these sequential patterns the true frequent sequential patterns (TFSPs) and denote their set as: Note that, given a finite number of random samples from π (e.g., the dataset D), it is not possible to find the exact set TFSP(π, θ), and one has to resort to approximations of TFSP(π, θ). Analogously to the two approximations defined for the FSPs, now we define two approximations of the TFSPs, depending on the application we are interested in: the first one that does not have false negatives, while the second one that does not contain false positives.

VC-Dimension
The Vapnik-Chervonenkis (VC) dimension [2,23] of a space of points is a measure of the complexity or expressiveness of a family of indicator functions, or, equivalently, of a family of subsets, defined on that space. A finite bound on the VC-dimension of a structure implies a bound of the number of random samples required to approximately learn that structure.
We define a range space as a pair (X, R), where X is a finite or infinite set and R, the range set, is a finite or infinite family of subsets of X. The members of X are called points, while the members of R are called ranges. Given A ⊆ X, we define the projection of R in A as P R (A) = {r ∩ A : r ∈ R}.
We define 2 A as the power set of A, that is the set of all the possible subsets of A, including the empty set ∅ and A itself. If P R (A) = 2 A , then A is said to be shattered by R. The VC-dimension of a range space is the cardinality of the largest set shattered by the space.  1]. Let us consider the set Y = {x, y, z}, containing 3 points 0 ≤ x < y < z ≤ 1. It is not possible to find a range whose intersection with the set Y is {x, z}, since all the ranges [a, b], with 0 ≤ a ≤ b ≤ 1, containing x and z, also contain y. Then, VC(X, R) must be less than 3. Consider now the set Y = {x, y}, containing only 2 points 0 ≤ x < y ≤ 1. It is easy to see that Y is shattered by R, so VC(X, R) = 2.
The main application of VC-dimension in statistics and learning theory is to derive the sample size needed to approximately "learn" the ranges, as defined below. Definition 6. Let RS = (X, R) be a range space. Given ε ∈ (0, 1), a bag B of elements taken from X is an ε-bag of X if for all r ∈ R, we have Theorem 1. There is a constant c > 0 such that if (X, R) is a range space of VC-dimension ≤ d, and ε, δ ∈ (0, 1), then a bag B of m elements taken with independent random extractions with replacement from X, where m ≥ c is an ε-bag of X with probability ≥ 1 − δ.
The universal constant c has been experimentally estimated to be at most 0.5 [24]. In the remaining of this work, we will use c = 0.5. Note that Theorem 1 holds also when d is an upper bound to the empirical VC-dimension EVC(RS, B) of RS on B [25]. In that case, the bag B itself is an ε-bag of X.

Rademacher Complexity
The Rademacher complexity [3,23,26] is a tool to measure the complexity of a family of real-valued functions. Bounds based on the Rademacher complexity depend on the distribution of the dataset, differently from the ones based on VC-dimension that are distribution independent.
Let D be a dataset of n transactions D = {t 1 , . . . , t n }. For each i ∈ {1, . . . , n}, let σ i be an independent Rademacher random variable (r.v.) that takes value 1 or −1, each with probability 1/2. Let G be a set of real-valued functions. The empirical Rademacher complexity R D on D is defined as follows: where the expectation is taken w.r.t. the Rademacher r.v. σ i 's. The Rademacher complexity is a measure of the expressiveness of the set G. A specific combination of σ's represents a splitting of D into two random sub-samples D 1 and D −1 . For a function g ∈ G, ∑ n i=1 g(t i )/n represents a good approximation of E[g] over D if n is sufficiently large. ∑ n i=1 σ i g(t i )/n represents instead the difference between E[g] over the two random sub-samples D 1 and D −1 . By considering the expected value of the supremum of this difference over the set G, we get the empirical Rademacher complexity. Therefore, the intuition is that if R D is small, the dataset D is sufficiently large to ensure a good estimate of E[g] for every g ∈ G. In this work, we study the Rademacher complexity of sequential patterns, which has not been explored before.

Maximum Deviation
Let M be a probability distribution over a domain set Z. Let F be a set of functions that go from Z to [−1, 1]. Given a function f ∈ F , we define the expectation of f as: and, given a sample Z of n observations z 1 , . . . , z n drawn from M, the empirical average of f on Z as: The maximum deviation is defined as the largest difference between the expectation of a function f and its empirical average on sample Z as: We now use the maximum deviation to capture quantities of interest for the two mining tasks we consider in this work.
In the frequent pattern mining scenario, we aim to find good estimates for f D (p) for each pattern p. The frequency f D (p) is the expectation of a Bernoulli random variable (r.v.) X D (p, t) which is 1 if the pattern p appears in a transaction t drawn uniformly at random from D: Let S be a sample of transactions drawn uniformly and independently at random from D. We define the frequency f S (p) as the fraction of transactions of S where p appears. In this scenario, we have that the frequency f D (p) of p on D and the frequency f S (p) of p on the sample S represent, respectively, the expectation E( f p ) and the empirical average E( f p , S) of a function f p associated with a pattern p. Thus, the maximum deviation is: In the true frequent pattern mining scenario, we aim to find good estimates for t π (p) for each pattern p. Note that the true frequency t π (p) is the expectation of a Bernoulli r.v. which is 1 if the pattern p appears in a transaction drawn from π. Moreover, it is easy to prove that the observed frequency f D (p) of a pattern p in a dataset D of transactions drawn from π is an unbiased estimator for t π (p), that is: Therefore, the true frequency t π (p) and the frequency f D (p) observed on the dataset D represent, respectively, the expectation E( f p ) and the empirical average E( f p , D) of a function f p associated with a pattern p. Thus, the maximum deviation is: In the next sections, we provide probabilistic upper bounds to the maximum deviation using the VC-dimension and Rademacher complexity which can therefore be used for frequent pattern mining and true frequent pattern mining scenarios.

VC-Dimension of Sequential Patterns
In this section, we apply the statistical learning theory concept of VC-dimension to sequential patterns. First, we define the range space associated with a sequential dataset. Then, we show a computable efficient upper bound on the VC-dimension and, finally, we present two applications of such upper bound. The first one is to compute the size of a sample that guarantees to obtain a good approximation for the problem of mining the frequent sequential patterns. The second one is to compute an upper bound on the maximum deviation to mine the true frequent sequential patterns.
Remember that a range space is a pair (X, R) where X contains points and R contains ranges. For a sequential dataset, X is the dataset itself, while R contains the sequential transactions that are the support set for some sequential patterns.
Definition 7. Let D be a sequential dataset consisting of sequential transactions and let I be its ground set. Let U be the set of all sequences built with itemsets containing item from I. We define RS = (X, R) to be a range space associated with D such that: U} is a family of sets of sequential transactions such that for each sequential pattern p, the set T D (p) = {τ ∈ D : p τ} is the support set of p on D .
The VC-dimension of this range space is the maximum size of a set of sequential transactions that can be shattered by the support sets of the sequential patterns.
Example 3. Consider the following dataset D = {τ 1 , τ 2 , τ 3 , τ 4 } as an example: The dataset above has 4 transactions. We now show that the VC-dimension of the range space RS associated with D is 2. Let us consider the set Since |A| = 2 and A is shattered by R, then the range space associated with D has VC-dimension ≥ 2. Analogously, the sets {τ 1 , . Thus, the VC-dimension of the range space associated with D is exactly 2.
The exact computation of the (empirical) VC-dimension of the range space associated with a dataset D is computationally expensive. The s-index, introduced by Servan-Schreiber et al. [8], provides an efficiently computable upper bound on the VC-dimension of sequential patterns. Such upper bound is based on the notion of capacity c(p) of a sequence p. The capacity c(p) of a sequence p is the number of distinct subsequences of p, that is, c(p) = |{z : z p}|. The exact capacity can be computed using the algorithm described in Reference [27], but it is computationally expensive and may be prohibitive for large datasets. Instead, Reference [8] proposed an algorithm to compute a more efficient upper boundc(p) ≥ c(p). Let us consider that a first simple bound is given by 2 ||p|| − 1, that may be a loose upper bound of c(p) because it is obtained by considering all the items contained in all the itemsets in p as distinct, that is, the capacity of the sequence p is 2 ||p|| − 1 if and only if all the items contained in all the itemsets of the sequence p are different. The bound proposed by Reference [8] can be computed as follows. When p contains, among others, two itemsets A and B such that A ⊆ B, subsequences of the form C with C ⊆ A are considered twice in 2 ||p|| − 1, "generated" once from A and once from B. To avoid over-counting such 2 |A| − 1 subsequences, Reference [8] proposes to consider only the ones "generated" from the longest itemset that can generate them. Then, the s-index is defined as follows.
. Let D be a sequential dataset. The s-index of D is the maximum integer s such that D contains at least s different sequential transactions with upper bound to their capacitiesc(p) at least 2 s − 1, such that no one of them is a subset of another, that is the s sequential transactions form an anti-chain.
The following result from Reference [8] shows that the s-index is an upper bound to the VC-dimension of the range space for sequential patterns in D.
Theorem 2 (Lemma 3 [8]). Let D be a sequential dataset with s-index s. Then, the range space RS = (X, R) corresponding to D has VC-dimension ≤ s.
While an upper bound to the s-index can be computed in a streaming fashion, it still requires to check whether a transaction is a subset of the set of other transactions currently maintained in memory and that define the current value of the s-index. In addition, the computation of the upper boundc(p) on the capacity of a sequence p requires to check whether the itemsets of p are subsets of each others. To avoid such expensive operations, we define an upper bound to the s-index, that we call s-bound, which does not require to check whether the transactions form an anti-chain. Definition 9. Let D be a sequential dataset. The s-bound of D is the maximum integer s such that D contains at least s different sequential transactions with item-length at least s.
Algorithm 1 shows the pseudo-code to compute an upper bound to the s-bound in a streaming fashion. It uses an ordered set to maintain in memory the set of transactions that define the current value of the s-bound. The ordered set stores pairs composed by a transaction and its item-length, sorted by decreasing item-length. In addition, it uses a hash set to speed up the control on the equal transactions.
In practice, it is quite uncommon that the long sequences that define the value of the s-index are subsequences of other sequences, thus, removing the anti-chain constraint, the bound does not deteriorate. In addition, the usage of the naive algorithm to compute the upper bound on c(p), that is 2 ||p|| − 1, it is equivalent to consider the transactions that have item-length at least s to calculate the s-bound, making the computation much faster without worsening the bound on the VC-dimension in practice.

Compute the Sample Size for Frequent Sequential Pattern Mining
In this section, we show how to compute a sample size m for a random sample S of transactions taken from D such that the maximum deviation is bounded by ε/2, that is, , for a user-defined value ε, using the upper bound on the VC-dimension defined above. Such result underlies the sampling algorithm that will be introduced in Section 5. Algorithm 2 shows how to compute a sample size that guarantees that sup This algorithm is used in the sampling algorithm (Section 5). A). Let S be a random sample of m transactions taken with replacement from the sequential dataset D and ε, δ ∈ (0, 1). Let d be the s-bound of D. If m ≥ 2

Theorem 3 (Proof in Appendix
then

Compute an Upper Bound to the Max Deviation for the True Frequent Sequential Patterns
In this section, we show how to compute an upper bound on the maximum deviation µ VC /2 for the true frequent sequential pattern mining problem, that is, sup p∈U |t π (p) − f D (p)| ≤ µ VC /2, using the upper bound on the empirical VC-dimension. Such result underlies the strategy for mining the true frequent sequential patterns that will be introduced in Section 6.
We define a range space associated with the generative process π as a range space where the points X = U and the range set R = {T(p) : p ∈ U}. The s-bound of the dataset D, as defined above, is an upper bound on the empirical VC-dimension of the range space associated with π computed on D. Algorithm 3 shows how to compute an upper bound on the maximum deviation that is used in the true frequent sequential pattern mining algorithm (Section 6). A). Let D be a finite bag of |D| i.i.d. samples from an unknown probability distribution π on U and δ ∈ (0, 1). Let d be the s-bound of D. If

Rademacher Complexity of Sequential Patterns
In this section we introduce the Rademacher complexity of sequential patterns. We propose a method for finding an efficiently computable upper bound to the empirical Rademacher complexity R D of sequential patterns (similar to what has been done in Reference [5] for itemsets) and a method for approximating it. In the true frequent pattern mining scenario, these results will be useful for defining a quantity which is an upper bound to the maximum deviation sup p∈U |t π (p) − f D (p)| with high probability.
The introduction of the Rademacher complexity of sequential patterns requires the definition of a set of real-valued functions. We define, for each pattern p ∈ U, the indicator function φ p : U → {0, 1} as: where t is a transaction. Given a transaction t of a dataset D with n transactions, φ p (t) is 1 if p appears in t, otherwise it is 0. We define the set of real-valued functions as the family of these indicator functions.
The frequency of p in D can be defined using the indicator function φ p : f D (p) = ∑ t∈D φ p (t)/n. The (empirical) Rademacher complexity R D on a given dataset D is defined as: where the expectation is taken w.r.t. the Rademacher r.v. σ i , that is, conditionally on the dataset D. The connection between the Rademacher complexity of sequential patterns and the maximum deviation is given by the following theorem, which derives from standard results in statistical learning theory (Thm. 3.2 in Reference [3]).

Theorem 5.
With probability at least 1 − δ: The naïve computation of the exact value of R D is expensive since it requires to mine all patterns from D and to generate all possible 2 n combination values of the Rademacher variables for the computation of the expectation. In the next sections we present an efficiently computable upper bound on the Rademacher complexity of sequential patterns and a simple method that approximates it, which are useful to find, respectively, an upper bound and an approximation to µ R /2.

An Efficiently Computable Upper Bound to the Rademacher Complexity of Sequential Patterns
For any pattern p ∈ U, let us define the following |D|-dimensional vector Note that all the infinite sequences of the universe U which do not appear in D are associated with the vector (0, . . . , 0) of |D| zeros. This implies the finiteness of the size of V D : |V D | < ∞. In addition, defining |U(D)| as the number of sequential patterns that appear in D, we have that potentially |V D | |U(D)|, since there may be two or more patterns associated with the same vector v D ∈ V D (i.e., these patterns appear exactly in the same transactions).
The following two theorems derive from known results of statistical learning theory (Thm. 3.3 of Reference [3]). Both theorems have been used for mining frequent itemsets [5], and can be applied for sequential pattern mining.
The following theorem is a stronger version of the previous one.
The upper bound on R D of Theorem 7 is not directly applicable to sequential pattern mining since it requires to mine every pattern that appear in D in order to determine the entire set V D . However, the set V D is related to the set of closed sequential patterns on D. The following two results give us an upper bound to the size of V D which depends on the number of closed sequential patterns of D. A simple example where C = 2 is depicted in Figure 1. Note first of all that each super-sequence of x 1 but not of x 2 has support lower than the support of x 1 , and each super-sequence of x 2 but not of x 1 has support lower than the support of x 2 . Let y τ = τ x 1 ,x 2 be the subsequence of transaction τ restricted to only the sequences x 1 and x 2 , preserving the relative order of their itemsets. Then y τ 1 = y τ 3 = y τ 2 which implies |T W (y τ 1 )|, |T W (y τ 2 )|, and |T W (y τ 3 )| be lower than |T W (x 1 )| = |T W (x 2 )| = |W|. Therefore each super-sequence of both x 1 and x 2 has support lower than the support of x 1 (i.e. equal to the one of x 2 ). Thus, x 1 and x 2 are closed sequences in D with the same support set W. Note that the previous lemma represents a sequential patterns version of Lemma 3 of Reference [5] for itemsets, where the upper bound to the number of closed itemsets in D with W as support set is one (this holds by the nature of the itemsets where the notion of "ordering" is not defined). Lemma 1 is crucial for proving the following lemma which provides a bound on the size of the set V D of binary vectors. Combining a partitioning of CS(D) with the previous lemma we can define a functionw, an upper bound to the function w of Theorem 7, which is efficient to compute with a single scan of D. Let I be the set of items that appear in the dataset D and < o be its increasing ordering by their support in D (ties broken arbitrarily). Given an item a, let T D ( {a} ) be its support set on D. Let < a denote the increasing ordering of the transactions T D ( {a} ) by the number of items contained that come after a w.r.t. the ordering < o (ties broken arbitrarily). Let CS(D) = C 1 ∪ C 2+ , where C 1 = {p ∈ CS(D) : ||p|| = 1} and C 2+ = {p ∈ CS(D) : ||p|| ≥ 2}. Let us focus on partitioning C 2+ . Let p ∈ C 2+ and let a be the item in p which comes before any other item in p w.r.t. the order < o . Let τ be the transaction containing p which comes before any other transaction containing p w.r.t. the order < a . We assign p to the set C a,τ . Remember that an item can appear multiple times in a sequence. Given a transaction τ ∈ T D ( {a} ), k a,τ is the number of items in τ (counted with their multiplicity) equal to a or that come after a in < o . Let m a,τ be the multiplicity of a in τ. For each k, m ≥ 1, m ≤ k, let g a,k,m be the number of transactions in T D ( {a} ) that contain exactly k items (counted with their multiplicity) equal to a or located after a in the ordering < o , with exactly m repetitions of a. Let χ a = max{k : g a,k,m > 0}. The following lemma gives us an upper bound to the size of C a,τ . A). We have |C a,τ | ≤ 2 k a,τ −m a,τ (2 m a,τ − 1).

Lemma 3 (Proof in Appendix
Combining the following partitioning of CS(D) as with the previous lemma, we obtain Now we are ready to define the functionw, which can be used to obtain an efficiently computable upper bound to R D . The following lemma represents the analogous of Lemma 5 of Reference [5], adjusted for sequential patterns. Let η be the average item-length of the transactions of D, that is, η = ∑ t∈D ||t||/n. Letη be the maximum item-length of the transactions of D, that is,η = max t∈D ||t||. Let η be an item-length threshold, with η < η ≤η. Let D(η) be the bag of transactions of D with item-length greater than η. Let V D(η) be the set of the 2 |D(η)| − 1 binary vectors associated with all possible non-empty sub-bags of D(η). A). Given an item a in I, we define the following quantity:

Lemma 4 (Proof in Appendix
Letw : R + → R + be the functioñ Then, For a given value of η, the functionw can be compute with a single scan of the dataset, since it requires to know g a,k,m for each a ∈ I and for each k, m, 1 ≤ k ≤ χ a , 1 ≤ m ≤ k. The values η, η, and the support of each item and consequently the ordering < o are obtained during the dataset creation. Thus, it is sufficient to look at each transaction τ, sorting the items I τ that appear in τ according to < o , and, for each item of I τ , keep track of its multiplicity m a,τ , compute k a,τ and increase by one g a,k a,τ ,m a,τ . Finally, sincew is convex and has first and second derivatives w.r.t. s everywhere in R + , its global minimum can be computed using a non-linear optimization solver. This procedure has to be repeated for each possible value of η in (η,η]. However, one could choose a particular schedule of values of η to be tested, instead of taking into account each possible value, achieving a value of the functionw near to its minimum. A possible choice is to look at the restricted interval [η + β 1 , min(β 2 ,η)], given two positive values for β 1 and β 2 , instead of investigating the whole interval (η,η]. This choice is motivated by the fact that in Lemma 4 the value of η gives us an idea of which term of the summation is dominant (the one based on closed sequential patterns or the one based on binary vectors). If η is close to η then the number of binary vectors we count could be high, the dominant term is the one based on the set of binary vectors, and we expect the upper bound to be high. Instead, if η is close toη then the upper bound to the number of closed sequential patterns we count could be high, and the set of binary vectors we take into account is small. In this case, the dominant term is the one based on the closed sequential patterns, and the value of the upper bound could be high (since we count many sequential patterns with item-length greater than η that instead would be associated with a small number of binary vectors). Thus, the best value of η will be the one that is larger than η and smaller thanη, enough to count not too many closed sequential patterns and binary vectors.
Finally, we define ComputeMaxDevRadeBound as the procedure for computing an upper bound to µ R /2 where, once the upper bound R b D to the Rademacher complexity R D is computed using Algorithm 4, the upper bound µ b R /2 to µ R /2 is obtained by The pseudo-code of the algorithm for computing the upper bound to R D follows. g a,k a,τ ,m a,τ + = 1; 8 χ a ← max(χ a , k a,τ ); 9 return min s∈R + ,η<η≤ηw (s, η);

Approximating the Rademacher Complexity of Sequential Patterns
The previous section presents an efficiently computable upper bound to the Rademacher of sequential patterns, which does not require any extraction of frequent sequences from a given dataset. Here we present a simple method that gives us an approximation of the Rademacher complexity of sequential patterns, which provides a tighter bound to the maximum deviation compared to the ones previously presented.
In the definition of the Rademacher complexity, a given combination σ of the Rademacher r.v.
σ splits the dataset D of n transactions in two sub-samples D 1 (σ) and D −1 (σ): each transaction associated with 1 and −1 goes respectively into D 1 (σ) and D −1 (σ). For a given sequential pattern p ∈ U, let Supp D 1 (σ) (p) and Supp D −1 (σ) (p) be respectively the number of transactions of D 1 (σ) and D −1 (σ) in which p appears. Thus, the Rademacher complexity can be rewritten as follows: In our approximation method we generate a single combination σ of the Rademacher r.v. σ, instead of generating every possible combination and then taking the expectation. Given σ, the approximatioñ The first step of the procedure is to mine frequent sequential patterns from D 1 (σ) and D −1 (σ), given a frequency threshold κ. Let FSP(D 1 (σ), κ) and FSP(D −1 (σ), κ) be the sets of sequential patterns with support greater or equal than κ in D 1 (σ) and D −1 (σ), respectively. Let us define the following quantities: and If max(γ 1 , γ 2 )/n ≥ κ thenR D (σ) = max(γ 1 , γ 2 )/n, since each pattern p that is not frequent in both sub-samples has γ(p)/n lower than κ. Instead, if max(γ 1 , γ 2 )/n < κ the entire procedure is repeated with κ = max(γ 1 , γ 2 )/n. Note that, since the Rademacher complexity is a non-negative quantity, it is not necessary to look at patterns in FSP(D −1 (σ), κ) \ FSP(D 1 (σ), κ) since their γ(p)'s values are negative. The pseudo-code of the method for finding an approximation of R D is presented in Algorithm 5. The extraction of frequent sequences from the two sub-samples can be done using one of the many algorithms for mining frequent sequential patterns. Finally, we define ComputeMaxDevRadeApprox as the procedure for computing an approximation of µ R /2 where, once the approximation R a D of the Rademacher complexity R D is computed using Algorithm 5, the approximation µ a R /2 of µ R /2 is obtained by:

Sampling-Based Algorithm for Frequent Sequential Pattern Mining
We now present a sampling algorithm for frequent sequential pattern mining. The aim of this algorithm is to reduce the amount of data to consider to mine the frequent sequential patterns, in order to speed up the extraction of the sequential patterns and to reduce the amount of memory required. We define a random sample as a bag of m transactions taken uniformly and independently at random, with replacement, from D. Obtaining the exact set FSP(D, θ) from a random sample is not possible, thus we focus on obtaining an ε-approximation with probability at least 1 − δ, where δ ∈ (0, 1) is a confidence parameter, whose value, with ε, is provided in input by the user. Intuitively, if a random sample is sufficiently large, then the set of frequent sequential patterns extracted from the random sample well approximates the set FSP (D, θ). The challenge is to find the number of transactions that are necessary to obtain the desired ε-approximation. To compute such sample size, our approach uses the VC-dimension of sequential patterns (see Section 3.1). Theorem 8. Given ε, δ ∈ (0, 1), let S be a random sample of size m sequential transactions taken independently at random with replacement from the dataset D such that sup p∈U | f D (p) − f S (p)| ≤ ε/2 with probability at least 1 − δ. Then, given θ ∈ (0, 1], the set FSP(S, θ − ε/2) is an ε-approximation to FSP(D, θ) with probability at least 1 − δ.
In such a scenario, we have that for all sequential This also holds for the sequential patterns in C = FSP(S, θ − ε/2). Therefore, the set C satisfies Property 3 from Definition 1. It also means that for all p ∈ FSP(D, θ), f S (p) ≥ θ − ε/2, so such p ∈ C and C also satisfies Property 1. Now, let p * be a sequential pattern such that f D (p * ) < θ − ε. Then, f S (p * ) < θ − ε/2, that is p * / ∈ C, which allows us to conclude that C also has Property 2 from Definition 1. Since we know that sup p∈U | f D (p) − f S (p)| ≤ ε/2 with probability at least 1 − δ, then the set C is an ε-approximation to FSP(D, θ) with probability at least 1 − δ, which concludes the proof.
Theorem 8 provides a simple sampling-based algorithm to obtain an ε-approximation to FSP(D, θ) with probability ≥ 1 − δ: take a random sample of m transactions from D such that the maximum deviation is bounded by ε/2, that is, sup p∈U | f D (p) − f S (p)| ≤ ε/2; report in output the set FSP(S, θ − ε/2). As illustrated in Section 3.1, such sample size can be computed using an efficient upper bound on the VC-dimension, given in input the desired upper bound on the maximum deviation ε/2 (see Algorithm 2). Note that such sample size can not be computed with the Rademacher complexity, since the sample size appears in both terms of the right-hand side of Equation (23). Thus, it is not possible to fix the value of the bound on the maximum deviation to compute the sample size that provides such guarantees. Algorithm 6 shows the pseudo-code of the sampling algorithm.
We now provide the respective theorem to find a FPF ε-approximation.
In such a scenario, we have that for all sequential This also holds for the sequential patterns in F = FSP(S, θ + ε/2). Therefore, the set F satisfies Property 3 from Definition 2. It also means that for all p * / ∈ FSP(D, θ), f S (p * ) < θ + ε/2, so such p * / ∈ F and F also satisfies Property 1. Now, let p be a sequential pattern such that f D (p ) ≥ θ + ε. Then, f S (p ) ≥ θ + ε/2, that is p ∈ F , which allows us to conclude that F also has Property 2 from Definition 2. Since we know that sup p∈U | f D (p) − f S (p)| ≤ ε/2 with probability at least 1 − δ, then the set F is a FPF ε-approximation to FSP(D, θ) with probability at least 1 − δ, which concludes the proof.
1 m ← ComputeSampleSize(D, ε, δ); 2 S ← sample of m transactions taken independently at random with replacement from D; 3 C ← FSP(S, θ − ε/2); /* resp. θ + ε/2 to obtain a FPF ε-approximation */ 4 return C; As explained above, the sample size m can be computed with Algorithm 2 that uses an efficient upper bound on the VC-dimension of sequential patterns. Then, the sample is generated taking m transactions uniformly and independently at random, with replacement, from D. Finally, the mining of the sample S can be performed with any efficient algorithm for the exact mining of frequent sequential patterns. Figure 2 depicts a block diagram representing the relations between the algorithms presented in this work.

Algorithms for True Frequent Sequential Pattern Mining
In this section, we describe our approach to find rigorous approximations to the TFSPs. In particular, given a dataset D, that is a finite bag of |D| i.i.d. samples from an unknown probability distribution π on U, a minimum frequency threshold θ and a confidence parameter δ, we aim to find rigorous approximations of the TFSPs w.r.t. θ, defined in Definitions 3 and 4, with probability at least 1 − δ.
The intuition behind our approach is the following. If we know an upper bound µ/2 on the maximum deviation, that is sup p∈U |t π (p) − f D (p)| ≤ µ/2, we can identify a frequency thresholdθ (resp.θ) such that the set FSP(D,θ) is a FPF µ-approximation (resp. FSP(D,θ) is a µ-approximation) of TFSP(π, θ). The upper bound on the maximum deviation can be computed, as illustrated in the previous sections, with the empirical VC-dimension and with the Rademacher complexity.
We now describe how to identify the thresholdθ that allows to obtain a FPF µ-approximation. Suppose that sup p∈U |t π (p) − f D (p)| ≤ µ/2. In such a scenario, we have that every sequential pattern p * / ∈ TFSP(π, θ), and so that has t π (p * ) < θ, has a frequency f D (p * ) < θ + µ/2 =θ. Hence, the only sequential patterns that can have frequency in D greater or equal toθ = θ + µ/2, are those with true frequency at least θ. The intuition is that if we find a µ such that sup p∈U |t π (p) − f D (p)| ≤ µ/2, we know that all the sequences p ∈ U, that are not true frequent w.r.t θ, can not be in FSP(D,θ). The following theorem formalizes the strategy to obtain a FPF µ-approximation. Algorithm 7 shows the pseudo-code to mine the true frequent sequential patterns. Theorem 10 shows how to compute a corrected thresholdθ such that the set FSP(D,θ) is a FPF µ-approximation of TFSP(π, θ), that is, FSP(D,θ) only contains sequential patterns that are in TFSP(π, θ). It guarantees that with high probability the set FSP(D,θ) does not contain false positives but it has not guarantees on the number of false negatives, that is, sequential patterns that are in TFSP(π, θ) but not in FSP (D,θ). On the other hand, we might be interested in finding all the true frequent sequential patterns in TFSP(π, θ). The following result shows how to identify a threshold θ such that the set FSP(D,θ) contains all the true frequent sequential patterns in TFSP(π, θ) with high probability, that is, FSP(D,θ) is a µ-approximation of TFSP(π, θ). Note that while Theorem 11 provides guarantees on false negatives, it does not provide guarantees on the number of false positives in FSP(D,θ).
Algorithm 7 shows the pseudo-code of the two strategies to mine the true frequent sequential patterns. To compute an upper bound on the maximum deviation, it is possible to use Algorithm 3 based on the empirical VC-dimension or the two procedures ComputeMaxDevRadeBound (Equation (34)) and ComputeMaxDevRadeApprox (Equation (40)) based on the Rademacher complexity. The mining of D can be performed with any efficient algorithm for the exact mining of frequent sequential patterns. Figure 2 shows the relations between the algorithms we presented for mining true frequent sequential patterns.
Proof. Suppose that sup p∈U |t π (p) − f D (p)| ≤ µ/2. Thus, we have that for all the sequential patterns p ∈ U, it results f D (p) ∈ [t π (p) − µ/2, t π (p) + µ/2]. This also holds for the sequential patterns in E = FSP(D,θ). Therefore, the set E satisfies Property 3 of Definition 3. It also means that for all p ∈ TFSP(π, θ), f D (p) ≥ θ − µ/2 =θ, that is, p ∈ E , which allows us to conclude that E also has Property 1 from Definition 3. Now, let p * be a sequential pattern such that t π (p * ) < θ − µ. Then, f D (p * ) < θ − µ/2, that is p * / ∈ E , which allows us to conclude that E also has Property 2 from Definition 3. Since we know that sup p∈U |t π (p) − f D (p)| ≤ µ/2 with probability at least 1 − δ, then the set E is a µ-approximation of TFSP(π, θ) with probability at least 1 − δ, which concludes the proof.

Experimental Evaluation
In this section, we report the results of our experimental evaluation on multiple datasets to assess the performance of the algorithms we proposed in this work. The goals of the evaluation are the following: • Assess the performance of our sampling algorithm. In particular, to asses whether with probability 1 − δ the sets of frequent sequential patterns extracted from samples are ε-approximations, for the first strategy, and FPF ε-approximations, for the second one, of FSP(D, θ). In addition, we compared the performance of the sampling algorithm with the ones to mine the full datasets in term of execution time.

•
Assess the performance of our algorithms for mining the true frequent sequential patterns.
In particular, to assess whether with probability 1 − δ the set of frequent sequential patterns extracted from the dataset with the corrected threshold does not contain false positives, that is, it is a FPF µ-approximation of TSFP(π, θ), for the first method, and contains all the TFSPs, that is, it is a µ-approximation of TSFP(π, θ), for the second method. In addition, we compared the results obtained with the VC-dimension and with the Rademacher complexity, both used to compute an upper bound on the maximum deviation.
Since no sampling algorithm for rigorously approximating the set of frequent sequential patterns and no algorithm to mine true frequent sequential patterns have been previously proposed, we do not consider other methods in our experimental evaluation.

Implementation and Environment
The code to compute the bound on the VC-dimension (Algorithm 1) and to perform the evaluation has been developed in Java and executed using version 1.8.0_201. The code to compute the bound and the approximation to the Rademacher Complexity (resp. Algorithms 4 and 5) has been developed in C++. We have performed all our experiments on the same machine with 512 GB of RAM and 2 Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.3GHz. To mine sequential patterns, we used the PrefixSpan [10] implementation provided by the SPMF library [28]. We used NLopt [29] as non-linear optimization solver. Our open-source implementation and the code developed for the tests, including scripts to reproduce all results, are available online [30].

Datasets
In this section, we describe the datasets we used in our evaluation. We first describe the dataset used to evaluate our sampling algorithm for FSP mining, and then the datasets used for TFSP mining. All datasets are obtained starting from the following real datasets: • BIBLE: a conversion of the Bible into sequence where each word is an item; • BMS1: contains sequences of click-stream data from the e-commerce website Gazelle; • BMS2: contains sequences of click-stream data from the e-commerce website Gazelle; • FIFA: contains sequences of click-stream data from the website of FIFA World Cup 98; • KOSARAK: contains sequences of click-stream data from an Hungarian news portal; • LEVIATHAN: is a conversion of the novel Leviathan by Thomas Hobbes (1651) as a sequence dataset where each word is an item; • MSNBC: contains sequences of click-stream data from MSNBC website and each item represents the category of a web page; • SIGN: contains sign language utterance.
All the datasets used are publicly available online [31] and the code to generate the pseudo-artificial datasets, as described in the following sections, is provided [30]. The characteristics of the datasets are reported in Table 1. The typical scenario for the application of sampling is that the dataset to mine is very large, sometimes even too large to fit in the main memory of the machine. Thus, in applying sampling techniques, we aim to reduce the size of such dataset, considering only a sample of it, in order to obtain an amount of data of reasonable size. Since the number of transactions in each real dataset (shown in Table 1) is fairly limited, we replicated each dataset to reach modern datasets sizes. For each real dataset, we fixed a replication factor and we created a new dataset, replicating each transaction in the dataset a number of times equal to the replication factor. Then, the input data for the sampling algorithm is the new enlarged dataset. The replication factors used are the following: BIBLE and FIFA = 200x; BMS1, BMS2 and KOSARAK = 100x; LEVIATHAN = 1000x; MSNBC = 10x and SIGN = 10,000x.

TFSP Mining
To evaluate our algorithms to mine the true frequent sequential patterns, we need to know which are the sequential patterns that are frequently generated from the unknown generative process π. In particular, we need a ground truth of the true frequencies of the sequential patterns. We generated pseudo-artificial datasets by taking some of the datasets in Table 1 as ground truth for the true frequencies t π of the sequential patterns. For each ground truth, we created four new datasets by sampling sequential transactions uniformly at random from the original dataset. All the new datasets have the same number of transactions of the respectively ground truth, that is, the respectively original dataset. We used the original datasets as ground truth and we executed our evaluation in the new (sampled) datasets. Therefore, the true frequency of a sequential pattern is its frequency in the original dataset, that is, its frequency in the original dataset is exactly the same that such pattern would have in an hypothetical infinite number of transactions generated by the unknown generative process π.

Sampling Algorithm Results
In this section, we describe the results obtained with our sampling algorithm (Algorithm 6). As explained above, the typical scenario to apply sampling is that the dataset to mine is very large. Thus, we aim to reduce the size of such dataset, considering only a sample of it. In addition, from the sample, we aim to obtain a good approximation of the results that would have been obtained from the entire dataset. In all our experiments we fixed ε = 0.01 and δ = 0.1. The steps of the evaluation are the following (Algorithm 6): given a dataset D L as input, we compute the sample size m, using Algorithm 2, to obtain an ε = 0.01-approximation (resp. FPF 0.01-approximation) with probability at least 1 − δ = 0.90. Then, we extract a random sample S of m transactions from D L and we run the algorithm to mine the frequent sequential patterns on S. Finally, we verify whether the set of frequent sequential patterns extracted from the sample is a 0.01-approximation (resp. FPS 0.01-approximation) to FSP(D L , θ). For each dataset D L we repeat the experiment 5 times, and then we compute the fraction of times the sets of frequent sequential patterns extracted from the samples have the properties described in Definition 1 (resp. Definition 2). Table 2 shows the results. Table 2. Sampling algorithms results. For each enlarged dataset D L , we report θ, the ratio |S|/|D L | between the sample size |S| and the size of the enlarged dataset |D L |, Max_Abs_Err, the maximum max p∈C i | f D (p) − f S i (p)|, and Avg_Abs_Err, the average max p∈C i | f D (p) − f S i (p)|, over the 5 samples S i and with C i the set of frequent sequential patterns extracted from S i , the percentage of ε-approximations obtained over the 5 samples and the percentage of FPF ε-approximations obtained over the 5 samples. We observe that the samples obtained from the datasets are about 2 to 5 times smaller than the whole datasets. Moreover, in all the runs for all the datasets, we obtain an ε-approximation (resp. FPF ε-approximation). Such results are even better than the theoretical guarantees, that ensure to obtain such approximations with probability at least 90%. We also reported Max_Abs_Err = max S i ,i∈ [1,5] max p∈C i | f D (p) − f S i (p)| and Avg_Abs_Err = 1 5 ∑ S i ,i∈ [1,5] max p∈C i | f D (p) − f S i (p)|, where C i is the set of frequent sequential patterns extracted from the sample S i , i = 1, ..., 5 (since we run each experiment 5 times, there are 5 samples). They represent the maximum and the average, over the 5 runs, of the maximum absolute difference between the frequency that the sequential patterns have in the entire dataset and that they have in the sample, over all the sequential patterns extracted from the sample. Again, the results obtained are better than the theoretical guarantees, that ensure a maximal absolute difference lower than ε/2 = 0.005. Figure 3 shows the comparison between the average execution time of the sampling algorithm and the average execution time of the mining of the entire dataset, over the 5 runs. For all the datasets, the sampling algorithm requires less time than the mining of the whole dataset. For BMS1 and BMS2, the mining of the whole dataset is very fast since the number of frequent sequential patterns extracted from it is low. Thus, there is not a large difference between the execution time to mine the whole dataset and the execution time for the sampling algorithm, which is most due to the computation of the sample size. Similar results between our sampling algorithm and the mining of the whole dataset have also been obtained with KOSARAK and MSNBC. As expected, for all the datasets, the execution time of the sampling algorithm to obtain an ε-approximation is larger than the execution time of the sampling algorithm to obtain a FPF ε-approximation, since the minimum frequency threshold used in the first case is lower, resulting in a higher number of extracted sequential patterns. We now discuss some of the patterns extracted from the MSNBC dataset, for which richer information regarding the data is available. In particular, in MSNBC each transaction contains the sequence of click-stream data generated by a single view on the MSNBC website by a user, and each item represents the category of a visited webpage, such "frontpage", "news", "sports", and so forth.
The two most frequent sequential patterns extracted in the enlarged datasets with a classic FSP algorithm are single categories, that is, sequential patterns of item-length 1: { f rontpage} is the most frequent while {on − air} is the second one. They are also the two most frequent sequential patterns extracted in all the five samples using our sampling algorithms. The most frequent sequential patterns with item-length greater than one are the sequential patterns { f rontpage}, { f rontpage} and { f rontpage}, { f rontpage}, { f rontpage} . For { f rontpage}, { f rontpage} , 75% of the transactions in which it appears there is at least an instance of such pattern where the two items are consecutive. This means that users visited two consecutive webpages of the same category, "frontpage", or that they refreshed the same page twice, while in the 25% of the transactions in which it appears users visited webpages of other categories between the two "frontpage" webpages. Instead, for { f rontpage}, { f rontpage}, { f rontpage} the percentage of transactions in which the three items are consecutive is 59%. We also observed similar results with other categories: sequential patterns that are sequences of the same item, and so of the same category, have higher frequency. This fact highlights that users usually visit more frequently pages of the same category or that they refresh multiple times the same pages.
The most frequent sequential patterns that are not sequences of the same item are combinations of the items "frontpage" and "news", for example, { f rontpage}, {news} , { f rontpage}, {news}, {news} and {news}, { f rontpage} . Surprisingly, the item "on-air" alone is more frequent that the item "news" alone. This means that users visit "news" webpages coming from a "frontpage" more frequently than "on-air" webpages, though they visit more frequently "on-air" webpages.

True Frequent Sequential Patterns Results
In this section, we describe the results of our algorithms for mining the true frequent sequential patterns. In all these experiments, we fixed δ = 0.1. First of all, for each real dataset we generated 4 pseudo-artificial datasets D i , i ∈ [1,4] from the same ground truth. We mined the set FSP(D i , θ), and we compared it with the TFSPs, that is, the set FSP (D, θ), where D is the ground truth. Such experiments aim to verify whether the sets of the FSPs extracted from the pseudo-artificial datasets contain false positives and miss some TFSPs. Table 3 shows the fractions of times that the set FSP(D i , θ) contains false positives and misses TFSPs from the ground truth. We ran this evaluation over the four datasets D i , i ∈ [1,4], of the same size from the same ground truth and we reported the average. For each dataset, we report the results with two frequency thresholds θ. In almost all the cases, the FSPs mined from the pseudo-artificial datasets contain false positives and miss some TFSPs. In particular, with lower frequency thresholds (and, therefore, a larger number of patterns), the fraction of times we find false positives and false negatives usually increases. These results emphasize that, in general, the mining of the FSPs is not enough to learn interesting features of the underlying generative process of the data, and techniques like the ones introduced in this work are necessary. Then, we compute and compare the upper bounds to the maximum deviation introduced in the previous sections, since our strategy to find an approximation to the true frequent sequential patterns hinges on finding a tight upper bound to the maximum deviation. For each pseudo-artificial dataset, we computed the upper bound µ VC /2 to the maximum deviation using the VC-dimension based bound (ComputeMaxDevVC, Algorithm 3), the Rademacher complexity based bound µ b R /2 (ComputeMaxDevRadeBound, Equation (34)), and the Rademacher complexity approximation µ a R /2 (ComputeMaxDevRadeApprox, Equation (40)). Table 4 shows that the two methods for computing the upper bound to the maximum deviation using an upper to the empirical VC-dimension and Rademacher complexity are similar for BMS1 and BMS2, but for the other samples the VC-dimension-based algorithm is better than the one based on the Rademacher complexity bound by a factor between 2 and 3, that is, µ b R /µ VC ∈ [2,3]. Tighter upper bounds to the maximum deviation are provided by the method that uses the approximation of the Rademacher complexity. In our implementation of Algorithm 4 to compute an upper bound to the empirical Rademacher complexity of sequential patterns, we compute several upper bounds associated with different integer values of η ∈ [η + β 1 , min(β 2 ,η)] for fixed values of β 1 and β 2 , taking the minimum bound among those computed. In our experiments, we fixed β 1 = 20 and β 2 = 120. In practice, by increasing the value of η we observe a decreasing trend of the upper bound value until a minimum value is reached. Then, by increasing again the value of η the value of the upper bound increases until it converges to the one achieved with η =η. In addition, for each pseudo-artificial dataset the value of η associated with the minimum value of the upper bound to the maximum deviation is always found in [η + β 1 , min(β 2 ,η)], with β 1 = 20, β 2 = 120.
Finally, we evaluated the performance of our two strategies to mine an approximation of the true frequent sequential patterns, the first one with guarantees on the false positives and the second one with guarantees on the false negatives, using the upper bounds on the maximum deviation computed above. We considered the two tightest upper bounds, that are µ VC /2 and µ a R /2, computed respectively using the empirical VC-dimension and an approximation of the empirical Rademacher complexity. From each pseudo-artificial dataset, we mined the FSPs usingθ, for the first strategy, andθ, for the second one, respectively computed using Theorems 10 and 11, and we compared the sequential patterns extracted with the TFSPs from the ground truth. Table 5 shows the results for the strategy with guarantees on the false positives. Using µ VC /2 to compute the corrected frequency thresholdθ VC , our algorithm performs better than the theoretical guarantees in all the runs, since the number of times the output contains false positives is always equal to zero, while the theory guarantees a probability of at least 1 − δ = 0.9 to obtain the correct approximation. Obviously, this also happens using µ a R /2 to compute the corrected frequency thresholdθ R , since µ VC > µ a R . We also computed the average fraction of TFSPs reported in the output by the algorithm, that is, |FSP(D i ,θ)|/|TFSP|, since we aim to obtain as many TFSPs as possible. For all the datasets, it is possible to notice that the results obtained with the Rademacher complexity are better than the ones obtained with the VC-dimension, since the Rademacher allows to obtain a higher percentage of TFSPs in output. Table 6 shows the results for the strategy with guarantees on the false negatives. Similar to the previous case, our algorithm performs better than the theoretical guarantees in all the runs, since the number of times the algorithm misses some TFSPs is always equal to zero, with both the VC-dimension and the Rademacher complexity based results. We also report the average fractions of patterns in the output that are TFSPs, that is, |TFSP|/|FSP(D i ,θ)|, since we are interested in obtaining all the TFSPs but with less false positives as possible. Again, the results with the Rademacher complexity are better than the ones obtained with the VC-dimension, since the number of sequential patterns in the output of the algorithm that are TFSPs is higher using the Rademacher complexity. Table 5. Results of our algorithm for the TFSPs with guarantees on the false positives in 4 pseudo-artificial datasets D i for each ground truth. The table reports the frequency thresholds θ used in the experiments, the number of TFSPs in the ground truth, the number of times the output contains false positives usingθ VC = θ + µ VC /2 as frequency threshold and the average fraction of the reported TFSPs in the output using such frequency threshold, the number of times the output contains false positives usingθ R = θ + µ a R /2 and the average fraction of the reported TFSPs in the output using such frequency threshold.  Table 6. Results of our algorithm for the TFSPs with guarantees on the false negatives in 4 pseudo-artificial datasets D i for each ground truth. The table reports the frequency thresholds θ used in the experiments, the number of TFSPs in the ground truth, the number of times the output of the algorithm misses some TFSPs usingθ VC = θ − µ VC /2 as frequency threshold and the average fraction of sequential patterns that are TFSPs in the output using such frequency threshold, the number of times the output of the algorithm misses some TFSPs usingθ R = θ − µ a R /2 and the average fraction of sequential patterns that are TFSPs in the output using such frequency threshold. We now we briefly analyze the sequential patterns extracted from the MSNBC dataset using our TFSP algorithms. Since we considered the FSP extracted from the whole dataset as ground truth, that is, as TFSP, the considerations reported for the most frequent sequential patterns extracted from the whole dataset and from the samples (see previous section) are still valid for the true frequent sequential patterns that have higher frequency.

Ground
Using θ = 0.02, as shown in Tables 5 and 6, we find 97 true frequent sequential patterns. In the four pseudo-artificial datasets we extracted on average ≈126 and ≈230 sequential patterns with guarantees on the false negatives, using respectively the approximation on the Rademacher complexity and the VC-dimension. With the algorithms with guarantees on the false positives, we mined ≈74 and ≈54 sequential patterns, respectively.
is the most frequent sequential pattern that is a TFSP but that it is not returned by our algorithm with guarantees on the false positives using the VC-dimension, that is, it is one of the allowed false negatives, in all the four pseudo-artificial datasets. Instead, the corresponding algorithm that uses the approximation of the Rademacher complexity always returned such sequential pattern as a TFSP. The most frequent sequential patterns that are true frequent but that are not returned by our algorithm with guarantees on the false positives using the approximation of the Rademacher complexity are

Discussion
In this work, we studied two tasks related to sequential pattern mining: frequent sequential pattern mining and true frequent sequential pattern mining. For both tasks, we defined rigorous approximations and designed efficient algorithms to extract such approximations with high confidence using advanced tools from statistical learning theory. In particular, we devised an efficient sampling-based algorithm to approximate the set of frequent sequential patterns in large datasets using the concept of VC-dimension. We also devised efficient algorithms to mine the true frequent sequential patterns using VC-dimension and Rademacher complexity. Our extensive experimental evaluation shows that our sampling algorithm for mining frequent sequential patterns produces accurate approximations using samples that are small fractions of the whole datasets, thus vastly speeding up the sequential pattern mining task on very large datasets. For mining true frequent sequential patterns, our experimental evaluation shows that our algorithms obtain high-quality approximations, even better than guaranteed by their theoretical analysis. In addition, our evaluation shows that the upper bound on the maximum deviation computed using the approximation of the Rademacher complexity allows to obtain better results than the ones obtained with the upper bound on the maximum deviation computed using the empirical VC-dimension.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Missing Proofs
In this appendix we present the proofs not included in the main text.
Theorem 3. Let S be a random sample of m transactions taken with replacement from the sequential dataset D and ε, δ ∈ (0, 1). Let d be the s-bound of D. If m ≥ 2 This implies that p 1 and p 2 are closed sequences in D and since their support set is W, they belong to CS W (D). Thus, the case C = 2 could happen. Now we generalize this concept for a generic number C of closed sequential patterns, where 2 ≤ C ≤ |CS(D)|. Let H = {p 1 , p 2 , . . . , p C } be a set of C sequential patterns with W as support set. Assume that each super-sequence of p i but not of p k has support lower than the support of p i , for each i, k ∈ [1, . . . , C] with k = i. Let H p be the power set of H without the empty set and the sets made of only one sequence, that is, H p = P(H) \ {{∅}, {p 1 }, {p 2 }, . . . , {p C }}. So, in H p there are every possible subset of H of size greater than one. For a transaction τ ∈ W and h p ∈ H p , we define y τ (h p ) = τ h p as the subsequence of τ restricted to h p , that is, to only the sequences p ∈ h p , preserving the relative order of their itemsets. If ∀h p ∈ H p there exits a transaction τ ∈ W such that the support set of y τ (h p ) in W does not coincide with W, that is, T W (y τ (h p )) ⊂ W, then for each transaction τ ∈ W we have |T W (y τ (h p ))| < |T W (p 1 )| = · · · = |T W (p C )| = |W|. Hence each super-sequence made of only sequences of h p has support lower than the support of p i , for i = 1, . . . , C. Thus, each super-sequence of p i has a lower support compared to the support of p i , for i = 1, . . . , C. This implies that all sequences of H are closed sequence in D and since their support set is W, they belong to CS W (D). Let p ∈ U be a sequence of non-empty support set in D, that is, v D (p) = (0, . . . , 0). There are two possibilities: p is or is not a closed sequence in D. If p is not a closed sequence, then there exists a closed super-sequence y p with support equal to the support of p, so with v D (p) = v D (y). Thus, v D (p) is associated with at least one closed sequence. Combining this with the fact that each vector v ∈ V D is associated with at least one sequence p ∈ U and Lemma 1, then each vector of V D different from (0, . . . , 0) is associated with at least one closed sequential pattern of D. To conclude our proof is sufficient to show that there are no closed sequences associated with the vector (0, . . . , 0). Let SP ∞ = {p ∈ U : v D (p) = (0, . . . , 0)}. Note that |SP ∞ | = ∞. For each p ∈ SP ∞ , there always exists a super-sequence y p such that f D (p) = f D (y) = 0. This implies that each sequence of SP ∞ is not closed. Thus, V D = {v D (p) : p ∈ CS(D)} and |V D | = |V D | + 1 ≤ |CS(D)| + 1.

Lemma 3. We have
|C a,τ | ≤ 2 k a,τ −m a,τ (2 m a,τ − 1). (A6) Proof. C a,τ represents a subset of the set Φ of all those subsequences of τ that are made of only items equal to a or that come after a in < o , with item-length at least two and with at least one occurrence of a. Let us focus on finding an upper bound to |Φ|. In order to build such a generic subsequence of τ, it is sufficient to select i occurrences of a among the m a,τ available, with 1 ≤ i ≤ m a,τ , and choose j items among the remaining k a,τ − m a,τ items different from a. Note that if i = 1, then j must be greater than 0. Thus, using the fact that the sum of ( n k ) for k = 0, . . . , n is equal to 2 n , we have |Φ| ≤ m a,τ 1 k a,τ −m a,τ where the first inequality holds because some sequences of Φ are counted more times. Since |C a,τ | ≤ |Φ|, the thesis holds.
where C a,τ (η) is the set of closed sequential patterns of C a,τ with item-length in [2, η]. Since the set of items of the sequences in C 1 is a subset of I, we have For any p ∈ C a,τ (η), f D (p) ≤ f D ( {a} ) by the anti-monotonicity support property for sequential patterns. An upper bound to the size of C a,τ (η) can be computed in two ways, depending on the value of k a,τ . If k a,τ ≤ η, we can use Lemma 3: 2 k a,τ −m a,τ (2 m a,τ − 1) exp If k a,τ > η we have to count the number of possible closed sequential patterns with at least one item equal to a and with item-length in [2, η] that we can build from k a,τ items of τ: Finally, using the quantities χ,k,m and g previously defined and indicator functions we can merge the right-hand sides of the last two inequalities