# Mining Sequential Patterns with VC-Dimension and Rademacher Complexity

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### 1.1. Our Contributions

- We define rigorous approximations of the set of frequent sequential patterns and the set of true frequent sequential patterns. In particular, for both sets we define two approximations: one with no false negatives, that is, containing all elements of the set; and one with no false positives, that is, without any element that is not in the set. Our approximations are defined in terms of a single parameter, which controls the accuracy of the approximation and is easily interpretable.
- We study the VC-dimension and the Rademacher complexity of sequential patterns, two advanced concepts from statistical learning theory that have been used in other mining contexts, and provide algorithms to efficiently compute upper bounds for both. In particular, we provide a simple, but still effective in practice, upper bound to the VC-dimension of sequential patterns by relaxing the upper bound previously defined in Reference [8]. We also provide the first efficiently computable upper bound to the Rademacher complexity of sequential patterns. We also show how to approximate the Rademacher complexity of sequential patterns.
- We introduce a new sampling-based algorithm to identify rigorous approximations of the frequent sequential patterns with probability $1-\delta $, where $\delta $ is a confidence parameter set by the user. Our algorithm hinges on our novel bound on the VC-dimension of sequential patterns, and it allows to obtain a rigorous approximation of the frequent sequential patterns by mining only a fraction of the whole dataset.
- We introduce efficient algorithms to obtain rigorous approximations of the true frequent sequential patterns with probability $1-\delta $, where $\delta $ is a confidence parameter set by the user. Our algorithms use the novel bounds on the VC-dimension and on Rademacher complexity that we have derived, and they allow to obtain accurate approximations of the true frequent sequential patterns, where the accuracy depends on the size of the available data.
- We perform an extensive experimental evaluation analyzing several sequential datasets, showing that our algorithms provide high-quality approximations, even better than guaranteed by their theoretical analysis, for both tasks we consider.

#### 1.2. Related Work

## 2. Preliminaries

#### 2.1. Sequential Pattern Mining

**Example**

**1.**

#### 2.1.1. Frequent Sequential Pattern Mining

**Definition**

**1.**

- $\mathcal{C}$ contains a pair $(p,{f}_{p})$ for every $(p,{f}_{\mathcal{D}}\left(p\right))\in FSP(\mathcal{D},\theta )$;
- $\mathcal{C}$ contains no pair $(p,{f}_{p})$ such that ${f}_{\mathcal{D}}\left(p\right)<\theta -\epsilon $;
- for every $(p,{f}_{p})\in \mathcal{C}$, it holds $|{f}_{\mathcal{D}}\left(p\right)-{f}_{p}|\text{}\le \epsilon /2$.

**Definition**

**2.**

- $\mathcal{F}$ contains no pair $(p,{f}_{p})$ such that ${f}_{\mathcal{D}}\left(p\right)<\theta $;
- $\mathcal{F}$ contains all the pairs $(p,{f}_{p})$ such that ${f}_{\mathcal{D}}\left(p\right)\ge \theta +\epsilon $;
- for every $(p,{f}_{p})\in \mathcal{F}$, it holds $|{f}_{\mathcal{D}}\left(p\right)-{f}_{p}|\text{}\le \epsilon /2$.

#### 2.1.2. True Frequent Sequential Pattern Mining

**Definition**

**3.**

- $\mathcal{E}$ contains a pair $(p,{f}_{p})$ for every $(p,{t}_{\pi}\left(p\right))\in TFSP(\pi ,\theta )$;
- $\mathcal{E}$ contains no pair $(p,{f}_{p})$ such that ${t}_{\pi}\left(p\right)<\theta -\mu $;
- for every $(p,{f}_{p})\in \mathcal{E}$, it holds $|{t}_{\pi}\left(p\right)-{f}_{p}|\text{}\le \mu /2$.

**Definition**

**4.**

- $\mathcal{G}$ contains no pair $(p,{f}_{p})$ such that ${t}_{\pi}\left(p\right)<\theta $;
- $\mathcal{G}$ contains all the pairs $(p,{f}_{p})$ such that ${t}_{\pi}\left(p\right)\ge \theta +\mu $;
- for every $(p,{f}_{p})\in \mathcal{G}$, it holds $|{t}_{\pi}\left(p\right)-{f}_{p}|\text{}\le \mu /2$.

#### 2.2. VC-Dimension

**Definition**

**5.**

**Example**

**2.**

**Definition**

**6.**

**Theorem**

**1.**

#### 2.3. Rademacher Complexity

#### 2.4. Maximum Deviation

## 3. VC-Dimension of Sequential Patterns

**Definition**

**7.**

- $X=\mathcal{D}$ is the set of sequential transactions in the dataset;
- $\mathcal{R}=\{{T}_{\mathcal{D}}\left(p\right):p\in \mathbb{U}\}$ is a family of sets of sequential transactions such that for each sequential pattern p, the set ${T}_{\mathcal{D}}\left(p\right)=\{\tau \in \mathcal{D}:p\u2291\tau \}$ is the support set of p on $\mathcal{D}$.

**Example**

**3.**

**Definition**

**8**

**Theorem**

**2**

**.**Let $\mathcal{D}$ be a sequential dataset with s-index s. Then, the range space $RS=(X,\mathcal{R})$ corresponding to $\mathcal{D}$ has VC-dimension $\le s$.

**Definition**

**9.**

Algorithm1: SBoundUpp($\mathcal{D}$): computation of an upper bound on the s-bound. |

#### 3.1. Compute the Sample Size for Frequent Sequential Pattern Mining

**Theorem**

**3**

**.**Let S be a random sample of m transactions taken with replacement from the sequential dataset $\mathcal{D}$ and $\epsilon ,\delta \in (0,1)$. Let d be the s-bound of $\mathcal{D}$. If

Algorithm2: ComputeSampleSize($\mathcal{D},\epsilon ,\delta $): computation of the sample size such that ${sup}_{p\in \mathbb{U}}|{f}_{\mathcal{D}}\left(p\right)-{f}_{S}\left(p\right)|\le \epsilon /2$ with probability $\ge 1-\delta $. |

Data: Dataset $\mathcal{D}$; $\phantom{\rule{0.277778em}{0ex}}\epsilon ,\delta \in (0,1)$.Result: The sample size m.1 $d\leftarrow $ SBoundUpp($\mathcal{D}$); 2 $m\leftarrow 2/{\epsilon}^{2}\left(d+ln(1/\delta )\right)$; 3 return m; |

#### 3.2. Compute an Upper Bound to the Max Deviation for the True Frequent Sequential Patterns

**Theorem**

**4**

**.**Let $\mathcal{D}$ be a finite bag of $\left|\mathcal{D}\right|$ i.i.d. samples from an unknown probability distribution π on $\mathbb{U}$ and $\delta \in (0,1)$. Let d be the s-bound of $\mathcal{D}$. If

Algorithm3: ComputeMaxDevVC($\mathcal{D},\delta $): computation of an upper bound on the max deviation for the true frequent sequential pattern mining problem. |

Data: Dataset $\mathcal{D}$; $\phantom{\rule{0.277778em}{0ex}}\delta \in (0,1)$.Result: Upper bound to the max deviation ${\mu}_{VC}/2$.1 $d\leftarrow $ SBoundUpp($\mathcal{D}$); 2 ${\mu}_{VC}\leftarrow \sqrt{2/\left|\mathcal{D}\right|\left(d+ln(1/\delta )\right)}$; 3 return ${\mu}_{VC}/2$; |

## 4. Rademacher Complexity of Sequential Patterns

**Theorem**

**5.**

#### 4.1. An Efficiently Computable Upper Bound to the Rademacher Complexity of Sequential Patterns

**Theorem**

**6.**

**Theorem**

**7.**

**Lemma**

**1**

**.**Consider a subset W of the dataset $\mathcal{D}$, $W\subseteq \mathcal{D}$. Let $C{S}_{W}\left(\mathcal{D}\right)$ be the set of closed sequential patterns in $\mathcal{D}$ whose support set in $\mathcal{D}$ is W, that is, $C{S}_{W}\left(\mathcal{D}\right)=\{p\in CS\left(\mathcal{D}\right):{T}_{\mathcal{D}}\left(p\right)=W\}$, with $C=|C{S}_{W}\left(\mathcal{D}\right)|$. Then the number C of closed sequential patterns in $\mathcal{D}$ with W as support set satisfies: $0\le C\le \left|CS\right(\mathcal{D}\left)\right|$.

**Lemma**

**2**

**.**${V}_{\mathcal{D}}=\{{v}_{\mathcal{D}}\left(p\right):\phantom{\rule{4pt}{0ex}}p\in CS\left(\mathcal{D}\right)\}\cup \left\{(0,\cdots ,0)\right\}$ and $|{V}_{\mathcal{D}}|\le |CS\left(\mathcal{D}\right)|+1$, that is, each vector of ${V}_{\mathcal{D}}$ different from $(0,\cdots ,0)$ is associated with at least one closed sequential pattern in $\mathcal{D}$.

**Lemma**

**3**

**.**We have

**Lemma**

**4**

**.**Given an item a in $\mathcal{I}$, we define the following quantity:

Algorithm4: RadeBound($\mathcal{D}$): algorithm for bounding the empirical Rademacher complexity of sequential patterns |

#### 4.2. Approximating the Rademacher Complexity of Sequential Patterns

Algorithm5: RadeApprox($\mathcal{D},\kappa $): algorithm for approximating the Rademacher complexity of sequential patterns. |

## 5. Sampling-Based Algorithm for Frequent Sequential Pattern Mining

**Theorem**

**8.**

**Proof.**

**Theorem**

**9.**

**Proof.**

Algorithm6: Sampling-Based Algorithm for Frequent Sequential Pattern Mining. |

Data: Dataset $\mathcal{D}$; $\phantom{\rule{0.277778em}{0ex}}\epsilon ,\delta \in (0,1)$; $\phantom{\rule{0.277778em}{0ex}}\theta \in (0,1]$.Result: Set $\mathcal{C}$ that is an $\epsilon $-approximation (resp. a FPF $\epsilon $-approximation) to $FSP(\mathcal{D},\theta )$ with probability $\ge 1-\delta $.1 $m\leftarrow $ ComputeSampleSize($\mathcal{D},\epsilon ,\delta )$ 2 $S\leftarrow $ sample of m transactions taken independently at random with replacement from $\mathcal{D}$; 3 $\mathcal{C}\leftarrow FSP(S,\theta -\epsilon /2)$; /* resp. $\theta +\epsilon /2$ to obtain a FPF $\epsilon $-approximation */ 4 return$\mathcal{C}$; |

## 6. Algorithms for True Frequent Sequential Pattern Mining

**Theorem**

**10.**

**Proof.**

**Theorem**

**11.**

**Proof.**

Algorithm7: Mining the True Frequent Sequential Patterns. |

Data: Dataset $\mathcal{D}$; $\phantom{\rule{0.277778em}{0ex}}\delta \in (0,1)$; $\phantom{\rule{0.277778em}{0ex}}\theta \in (0,1]$Result: Set $\mathcal{G}$ that is a FPF $\mu $-approximation (resp. $\mu $-approximation) to $TFSP(\pi ,\theta )$ with probability $\ge 1-\delta $.1 $\mu /2\leftarrow $ ComputeMaxDeviationBound($\mathcal{D},\delta $); 2 $\mathcal{G}\leftarrow FSP(\mathcal{D},\theta +\mu /2)$; /* resp. $\theta -\mu /2\phantom{\rule{0.277778em}{0ex}}$ to obtain a $\mu $-approximation */ 3 return$\mathcal{P}$; |

## 7. Experimental Evaluation

- Assess the performance of our sampling algorithm. In particular, to asses whether with probability $1-\delta $ the sets of frequent sequential patterns extracted from samples are $\epsilon $-approximations, for the first strategy, and FPF $\epsilon $-approximations, for the second one, of $FSP(\mathcal{D},\theta )$. In addition, we compared the performance of the sampling algorithm with the ones to mine the full datasets in term of execution time.
- Assess the performance of our algorithms for mining the true frequent sequential patterns. In particular, to assess whether with probability $1-\delta $ the set of frequent sequential patterns extracted from the dataset with the corrected threshold does not contain false positives, that is, it is a FPF $\mu $-approximation of $TSFP(\pi ,\theta )$, for the first method, and contains all the TFSPs, that is, it is a $\mu $-approximation of $TSFP(\pi ,\theta )$, for the second method. In addition, we compared the results obtained with the VC-dimension and with the Rademacher complexity, both used to compute an upper bound on the maximum deviation.

#### 7.1. Implementation and Environment

#### 7.2. Datasets

- BIBLE: a conversion of the Bible into sequence where each word is an item;
- BMS1: contains sequences of click-stream data from the e-commerce website Gazelle;
- BMS2: contains sequences of click-stream data from the e-commerce website Gazelle;
- FIFA: contains sequences of click-stream data from the website of FIFA World Cup 98;
- KOSARAK: contains sequences of click-stream data from an Hungarian news portal;
- LEVIATHAN: is a conversion of the novel Leviathan by Thomas Hobbes (1651) as a sequence dataset where each word is an item;
- MSNBC: contains sequences of click-stream data from MSNBC website and each item represents the category of a web page;
- SIGN: contains sign language utterance.

#### 7.2.1. FSP Mining

#### 7.2.2. TFSP Mining

#### 7.3. Sampling Algorithm Results

#### 7.4. True Frequent Sequential Patterns Results

## 8. Discussion

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Missing Proofs

**Theorem**

**3.**

**Proof.**

**Theorem**

**4.**

**Proof.**

**Lemma**

**1.**

**Proof.**

**Lemma**

**2.**

**Proof.**

**Lemma**

**3.**

**Proof.**

**Lemma**

**4.**

**Proof.**

## References

- Agrawal, R.; Srikant, R. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, Taipei, China, 6–10 March 1995; pp. 3–14. [Google Scholar]
- Vapnik, V.N.; Chervonenkis, A.Y. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. In Measures of Complexity; Vovk, V., Papadopoulos, H., Gammerman, A., Eds.; Springer: Cham, Switzerland, 2015. [Google Scholar]
- Boucheron, S.; Bousquet, O.; Lugosi, G. Theory of classification: A survey of some recent advances. ESAIM Probab. Stat.
**2005**, 9, 323–375. [Google Scholar] [CrossRef] - Riondato, M.; Upfal, E. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Trans. Knowl. Discov. D
**2014**, 8, 20. [Google Scholar] [CrossRef][Green Version] - Riondato, M.; Upfal, E. Mining frequent itemsets through progressive sampling with rademacher averages. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 22–27 August 2015; pp. 1005–1014. [Google Scholar]
- Raïssi, C.; Poncelet, P. Sampling for sequential pattern mining: From static databases to data streams. In Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 631–636. [Google Scholar]
- Riondato, M.; Vandin, F. Finding the true frequent itemsets. In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, PA, USA, 28 April 2014; pp. 497–505. [Google Scholar]
- Servan-Schreiber, S.; Riondato, M.; Zgraggen, E. ProSecCo: Progressive sequence mining with convergence guarantees. Knowl. Inf. Syst.
**2020**, 62, 1313–1340. [Google Scholar] [CrossRef] - Srikant, R.; Agrawal, R. Mining sequential patterns: Generalizations and performance improvements. In Advances in Database Technology–EDBT ’96, Proceedings of the International Conference on Extending Database Technology, Avignon, France, 25–29 March 1996; Springer: Berlin/Heidelberg, Germany, 1996; pp. 1–17. [Google Scholar]
- Pei, J.; Han, J.; Mortazavi-Asl, B.; Wang, J.; Pinto, H.; Chen, Q.; Dayal, U.; Hsu, M.C. Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Trans. Knowl. Data Eng.
**2004**, 16, 1424–1440. [Google Scholar] - Wang, J.; Han, J.; Li, C. Frequent closed sequence mining without candidate maintenance. IEEE Trans. Knowl. Data Eng.
**2007**, 19, 1042–1056. [Google Scholar] [CrossRef] - Pellegrina, L.; Pizzi, C.; Vandin, F. Fast Approximation of Frequent k-mers and Applications to Metagenomics. J. Comput. Biol.
**2019**, 27, 534–549. [Google Scholar] [CrossRef] [PubMed][Green Version] - Riondato, M.; Vandin, F. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19 July 2018; pp. 2130–2139. [Google Scholar]
- Al Hasan, M.; Chaoji, V.; Salem, S.; Besson, J.; Zaki, M.J. Origami: Mining representative orthogonal graph patterns. In Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 153–162. [Google Scholar]
- Corizzo, R.; Pio, G.; Ceci, M.; Malerba, D. DENCAST: distributed density-based clustering for multi-target regression. J. Big Data
**2019**, 6, 43. [Google Scholar] [CrossRef] - Cheng, J.; Fu, A.W.c.; Liu, J. K-isomorphism: privacy preserving network publication against structural attacks. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, 6–11 June 2010; pp. 459–470. [Google Scholar]
- Riondato, M.; Upfal, E. ABRA: Approximating betweenness centrality in static and dynamic graphs with rademacher averages. ACM Trans. Knowl. Discov. D
**2018**, 12, 1–38. [Google Scholar] [CrossRef] - Mendes, L.F.; Ding, B.; Han, J. Stream sequential pattern mining with precise error bounds. In Proceedings of the Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 941–946. [Google Scholar]
- Pellegrina, L.; Riondato, M.; Vandin, F. SPuManTE: Significant Pattern Mining with Unconditional Testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1528–1538. [Google Scholar]
- Gwadera, R.; Crestani, F. Ranking Sequential Patterns with Respect to Significance. In Advances in Knowledge Discovery and Data Mining; Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V., Eds.; Springer: Berlin, Germany, 2010; Volume 6118. [Google Scholar]
- Low-Kam, C.; Raïssi, C.; Kaytoue, M.; Pei, J. Mining statistically significant sequential patterns. In Proceedings of the IEEE 13th International Conference on Data Mining, Dallas, TX, USA, 7–10 December 2013; pp. 488–497. [Google Scholar]
- Tonon, A.; Vandin, F. Permutation Strategies for Mining Significant Sequential Patterns. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 1330–1335. [Google Scholar]
- Mitzenmacher, M.; Upfal, E. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis; Cambridge University Press: New York, NY, USA, 2017. [Google Scholar]
- Löffler, M.; Phillips, J.M. Shape fitting on point sets with probability distributions. In Algorithms–ESA 2009, Proceedings of the European Symposium on Algorithms, Copenhagen, Denmark, 7–9 September 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 313–324. [Google Scholar]
- Li, Y.; Long, P.M.; Srinivasan, A. Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci.
**2001**, 62, 516–527. [Google Scholar] [CrossRef][Green Version] - Shalev-Shwartz, S.; Ben-David, S. Understanding machine learning: From theory to algorithms; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar]
- Egho, E.; Raïssi, C.; Calders, T.; Jay, N.; Napoli, A. On measuring similarity for sequences of itemsets. Data Min. Knowl. Discov.
**2015**, 29, 732–764. [Google Scholar] [CrossRef][Green Version] - Fournier-Viger, P.; Lin, J.C.W.; Gomariz, A.; Gueniche, T.; Soltani, A.; Deng, Z.; Lam, H.T. The SPMF open-source data mining library version 2. In Machine Learning and Knowledge Discovery in Databases; Berendt, B., Ed.; Springer: Cham, Switzerland, 2016; Volume 9853, pp. 36–40. [Google Scholar]
- Johnson, S.G. The NLopt Nonlinear-Optimization Package. 2014. Available online: https://nlopt.readthedocs.io/en/latest/ (accessed on 10 April 2020).
- GitHub. VCRadSPM: Mining Sequential Patterns with VC-Dimension and Rademacher Complexity. Available online: https://github.com/VandinLab/VCRadSPM (accessed on 10 April 2020).
- SPMF Datasets. Available online: https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php (accessed on 10 April 2020).

**Figure 1.**Graphical representation of the case $C{S}_{W}\left(\mathcal{D}\right)=2$. Sequences ${x}_{1}$ and ${x}_{2}$ are closed sequences in $\mathcal{D}$ with the same support set W.

**Figure 3.**Execution time of the sampling algorithm. The execution time required to mine the whole dataset, and the execution times of the sampling algorithm to obtain an $\epsilon $-approximation and a false positives free (FPF) $\epsilon $-approximation are reported. For the sampling algorithms, we show the execution time to compute the sample size, the execution time to generate the sample, and the execution time to mine the sample.

**Table 1.**Datasets characteristics. For each dataset $\mathcal{D}$, we report the number $\left|\mathcal{D}\right|$ of transactions, the total number $\left|\mathcal{I}\right|$ of items, the average transaction item-length and the maximum transaction item-length.

Dataset $\mathcal{D}$ | Size $\left|\mathcal{D}\right|$ | $\left|\mathcal{I}\right|$ | Avg. Item-Length | Max. Item-Length |
---|---|---|---|---|

BIBLE | 36,369 | 13,905 | 21.6 | 100 |

BMS1 | 59,601 | 497 | 2.5 | 267 |

BMS2 | 77,512 | 3340 | 4.6 | 161 |

FIFA | 20,450 | 2990 | 36.2 | 100 |

KOSARAK | 69,999 | 14,804 | 8.0 | 796 |

LEVIATHAN | 5835 | 9025 | 33.8 | 100 |

MSNBC | 989,818 | 17 | 4.8 | 14,795 |

SIGN | 730 | 267 | 52.0 | 94 |

**Table 2.**Sampling algorithms results. For each enlarged dataset ${\mathcal{D}}_{L}$, we report $\theta $, the ratio $\left|S\right|/|{\mathcal{D}}_{L}|$ between the sample size $\left|S\right|$ and the size of the enlarged dataset $|{\mathcal{D}}_{L}|$, Max_Abs_Err, the maximum ${max}_{p\in {C}_{i}}|{f}_{\mathcal{D}}\left(p\right)-{f}_{{S}_{i}}\left(p\right)|$, and Avg_Abs_Err, the average ${max}_{p\in {C}_{i}}|{f}_{\mathcal{D}}\left(p\right)-{f}_{{S}_{i}}\left(p\right)|$, over the 5 samples ${S}_{i}$ and with ${C}_{i}$ the set of frequent sequential patterns extracted from ${S}_{i}$, the percentage of $\epsilon $-approximations obtained over the 5 samples and the percentage of FPF $\epsilon $-approximations obtained over the 5 samples.

Dataset ${\mathcal{D}}_{\mathit{L}}$ | $\mathit{\theta}$ | $\left|\mathit{S}\right|/|{\mathcal{D}}_{\mathit{L}}|$ | Max_Abs_Err (×${10}^{-4})$ | Avg_Abs_Err (×${10}^{-4})$ | $\mathit{\epsilon}$-approx (%) | FPF $\mathit{\epsilon}$-approx (%) |
---|---|---|---|---|---|---|

BIBLE | 0.1 | 0.24 | 9.33 | 7.47 | 100 | 100 |

BMS1 | 0.012 | 0.17 | 5.45 | 4.70 | 100 | 100 |

BMS2 | 0.012 | 0.16 | 4.08 | 3.14 | 100 | 100 |

FIFA | 0.25 | 0.50 | 8.68 | 7.07 | 100 | 100 |

KOSARAK | 0.02 | 0.52 | 7.18 | 4.95 | 100 | 100 |

LEVIATHAN | 0.15 | 0.30 | 9.19 | 7.84 | 100 | 100 |

MSNBC | 0.02 | 0.37 | 4.33 | 3.63 | 100 | 100 |

SIGN | 0.4 | 0.20 | 14.14 | 12.19 | 100 | 100 |

**Table 3.**Average fraction of times that $FSP({\mathcal{D}}_{i},\theta )$, with ${\mathcal{D}}_{i}$ a pseudo-artificial dataset, contains false positives, Times FPs, and misses true frequent sequential patterns (TFSPs) (false negatives), Times FNs, over 4 datasets ${\mathcal{D}}_{i}$ from the same ground truth.

Ground Truth | $\mathit{\theta}$ | |TFSP| | Times FPs | Times FNs |
---|---|---|---|---|

BIBLE | 0.1 | 174 | 50% | 100% |

0.05 | 774 | 100% | 100% | |

BMS1 | 0.025 | 13 | 50% | 0% |

0.0225 | 17 | 0% | 25% | |

BMS2 | 0.025 | 10 | 0% | 0% |

0.0225 | 11 | 0% | 0% | |

KOSARAK | 0.06 | 23 | 100% | 0% |

0.04 | 41 | 50% | 25% | |

LEVIATHAN | 0.15 | 225 | 75% | 100% |

0.1 | 651 | 100% | 100% | |

MSNBC | 0.02 | 97 | 75% | 25% |

0.015 | 143 | 100% | 50% |

**Table 4.**Comparison of the upper bound $\mu /2$ to the maximum deviation achieved respectively by ComputeMaxDevVC, ComputeMaxDevRadeBound, and ComputeMaxDevRadeApprox for each dataset. We show averages $avg$, maximum values $max$, and standard deviations $std$ for each dataset and method over the 4 pseudo-artificial datasets.

Dataset | ${\mathit{\mu}}_{\mathbf{VC}}/2$ | ${\mathit{\mu}}_{\mathit{R}}^{\mathit{b}}/2$ | ${\mathit{\mu}}_{\mathit{R}}^{\mathit{a}}/2$ | ||||||
---|---|---|---|---|---|---|---|---|---|

avg | max | std (×${10}^{-3}$) | avg | max | std (×${10}^{-3}$) | avg | max | std (×${10}^{-3}$) | |

BIBLE | 0.0339 | 0.0340 | 0.1 | 0.0747 | 0.0748 | 0.1 | 0.0207 | 0.0223 | 1.5 |

BMS1 | 0.0194 | 0.0197 | 0.3 | 0.0287 | 0.0294 | 0.6 | 0.0136 | 0.0153 | 1.0 |

BMS2 | 0.0194 | 0.0196 | 0.1 | 0.0202 | 0.0207 | 0.5 | 0.0107 | 0.0115 | 0.5 |

KOSARAK | 0.0334 | 0.0335 | 0.1 | 0.0957 | 0.0972 | 1.5 | 0.0145 | 0.0164 | 1.5 |

LEVIATHAN | 0.0847 | 0.0850 | 0.3 | 0.1878 | 0.1904 | 1.6 | 0.0569 | 0.0636 | 5.5 |

MSNBC | 0.0089 | 0.0090 | 0.1 | 0.0252 | 0.0257 | 0.9 | 0.0035 | 0.0041 | 0.4 |

**Table 5.**Results of our algorithm for the TFSPs with guarantees on the false positives in 4 pseudo-artificial datasets ${\mathcal{D}}_{i}$ for each ground truth. The table reports the frequency thresholds $\theta $ used in the experiments, the number of TFSPs in the ground truth, the number of times the output contains false positives using ${\widehat{\theta}}_{VC}=\theta +{\mu}_{VC}/2$ as frequency threshold and the average fraction of the reported TFSPs in the output using such frequency threshold, the number of times the output contains false positives using ${\widehat{\theta}}_{R}=\theta +{\mu}_{R}^{a}/2$ and the average fraction of the reported TFSPs in the output using such frequency threshold.

Ground Truth | $\mathit{\theta}$ | |TFSP| | Times FPs in FSP(${\mathcal{D}}_{\mathit{i}},{\widehat{\mathit{\theta}}}_{\mathbf{VC}}$) | |FSP(${\mathcal{D}}_{\mathit{i}},{\widehat{\mathit{\theta}}}_{\mathbf{VC}}\left)\right|$/ |TFSP| | Times FPs in FSP(${\mathcal{D}}_{\mathit{i}},{\widehat{\mathit{\theta}}}_{\mathit{R}}$) | |FSP(${\mathcal{D}}_{\mathit{i}},{\widehat{\mathit{\theta}}}_{\mathit{R}}\left)\right|$/ |TFSP| |
---|---|---|---|---|---|---|

BIBLE | 0.1 | 174 | 0 % | 0.55 | 0 % | 0.68 |

0.05 | 774 | 0 % | 0.32 | 0 % | 0.47 | |

BMS1 | 0.025 | 13 | 0 % | 0.38 | 0 % | 0.48 |

0.0025 | 17 | 0 % | 0.29 | 0 % | 0.43 | |

BMS2 | 0.025 | 10 | 0 % | 0.13 | 0 % | 0.20 |

0.0025 | 11 | 0 % | 0.18 | 0 % | 0.18 | |

KOSARAK | 0.06 | 23 | 0 % | 0.41 | 0 % | 0.73 |

0.04 | 41 | 0 % | 0.43 | 0 % | 0.74 | |

LEVIATHAN | 0.15 | 225 | 0 % | 0.30 | 0 % | 0.41 |

0.1 | 651 | 0 % | 0.18 | 0 % | 0.30 | |

MSNBC | 0.02 | 97 | 0 % | 0.56 | 0 % | 0.77 |

0.015 | 143 | 0 % | 0.50 | 0 % | 0.76 |

**Table 6.**Results of our algorithm for the TFSPs with guarantees on the false negatives in 4 pseudo-artificial datasets ${\mathcal{D}}_{i}$ for each ground truth. The table reports the frequency thresholds $\theta $ used in the experiments, the number of TFSPs in the ground truth, the number of times the output of the algorithm misses some TFSPs using ${\tilde{\theta}}_{VC}=\theta -{\mu}_{VC}/2$ as frequency threshold and the average fraction of sequential patterns that are TFSPs in the output using such frequency threshold, the number of times the output of the algorithm misses some TFSPs using ${\tilde{\theta}}_{R}=\theta -{\mu}_{R}^{a}/2$ and the average fraction of sequential patterns that are TFSPs in the output using such frequency threshold.

Ground Truth | $\mathit{\theta}$ | |TFSP| | Times FNs in FSP(${\mathcal{D}}_{\mathit{i}},{\tilde{\mathit{\theta}}}_{\mathbf{VC}}$) | |TFSP|/ |FSP(${\mathcal{D}}_{\mathit{i}},{\tilde{\mathit{\theta}}}_{\mathbf{VC}}\left)\right|$ | Times FNs in FSP(${\mathcal{D}}_{\mathit{i}},{\tilde{\mathit{\theta}}}_{\mathit{R}}$) | |TFSP|/ |FSP(${\mathcal{D}}_{\mathit{i}},{\tilde{\mathit{\theta}}}_{\mathit{R}}\left)\right|$ |
---|---|---|---|---|---|---|

BIBLE | 0.1 | 174 | 0 % | 0.42 | 0 % | 0.63 |

0.05 | 774 | 0 % | 0.09 | 0 % | 0.33 | |

BMS1 | 0.025 | 13 | 0 % | 0.07 | 0 % | 0.21 |

0.0025 | 17 | 0 % | 0.04 | 0 % | 0.19 | |

BMS2 | 0.025 | 10 | 0 % | 0.03 | 0 % | 0.32 |

0.0025 | 11 | 0 % | 0.01 | 0 % | 0.19 | |

KOSARAK | 0.06 | 23 | 0 % | 0.30 | 0 % | 0.64 |

0.04 | 41 | 0 % | 0.04 | 0 % | 0.49 | |

LEVIATHAN | 0.15 | 225 | 0 % | 0.12 | 0 % | 0.30 |

0.1 | 651 | 0 % | 0.01 | 0 % | 0.13 | |

MSNBC | 0.02 | 97 | 0 % | 0.42 | 0 % | 0.77 |

0.015 | 143 | 0 % | 0.24 | 0 % | 0.65 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Santoro, D.; Tonon, A.; Vandin, F. Mining Sequential Patterns with VC-Dimension and Rademacher Complexity. *Algorithms* **2020**, *13*, 123.
https://doi.org/10.3390/a13050123

**AMA Style**

Santoro D, Tonon A, Vandin F. Mining Sequential Patterns with VC-Dimension and Rademacher Complexity. *Algorithms*. 2020; 13(5):123.
https://doi.org/10.3390/a13050123

**Chicago/Turabian Style**

Santoro, Diego, Andrea Tonon, and Fabio Vandin. 2020. "Mining Sequential Patterns with VC-Dimension and Rademacher Complexity" *Algorithms* 13, no. 5: 123.
https://doi.org/10.3390/a13050123