Next Article in Journal
Motion of Particles around Time Conformal Dilaton Black Holes
Previous Article in Journal
Study on Transverse Deformation Characteristics of a Shield Tunnel under Earth Pressure by Refined Finite Element Analyses
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient Algorithm for Mining Stable Periodic High-Utility Sequential Patterns

Department of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
*
Author to whom correspondence should be addressed.
Symmetry 2022, 14(10), 2032; https://doi.org/10.3390/sym14102032
Submission received: 30 August 2022 / Revised: 20 September 2022 / Accepted: 22 September 2022 / Published: 28 September 2022
(This article belongs to the Section Computer)

Abstract

:
Periodic high-utility sequential pattern mining (PHUSPM) is used to extract periodically occurring high-utility sequential patterns (HUSPs) from a quantitative sequence database according to a user-specified minimum utility threshold (minutil). A sequential pattern’s periodicity is determined by measuring when the frequency of its periods (the time between two consecutive happenings of the sequential pattern) exceed a user-specified maximum periodicity threshold (maxPer). However, due to the strict judgment threshold, the traditional PHUSPM method has the problem that some useful sequential patterns are discarded and the periodic values of some sequential patterns fluctuate greatly (i.e., are unstable). In frequent itemset mining (FIM), some researchers put forward some strategies to solve these problems. Because of the symmetry of frequent itemset pattern (FIPs), these strategies cannot be directly applied to PHUSPM. In order to address these issues, this work proposes the stable periodic high-utility sequential pattern mining (SPHUSPM) algorithm. The contributions made by this paper are as follows. First, we introduce the concept of stability to overcome the abovementioned problems, mine sequential patterns with stable periodic behavior, and propose the concept of stable periodic high-utility sequential patterns (SPHUSPs) for the first time. Secondly, we design a new data structure named the PUL-list to record the periodic information of sequential patterns, thereby improving the mining efficiency. Thirdly, we propose the maximum lability pruning strategy in sequential pattern (MLPS), which can prune a large number of unstable sequential patterns in advance. To assess the algorithm’s effectiveness, we perform many experiments. It turns out that the algorithm can not only mine patterns that are ignored by traditional algorithms, but also ensure that the discovered patterns have stable periodic behavior. In addition, after using the MLPS pruning strategy, the algorithm can prune 46.5% of candidates in advance on average in six datasets. Pruning a large number of candidates in advance not only speeds up the mining process, but also greatly reduces memory usage.

1. Introduction

High-utility sequential pattern mining (HUSPM) [1,2,3,4,5,6,7,8,9,10,11] is a significant area of knowledge discovery and data mining that has been the subject of a great deal of research. HUSPM has been applied in many applications, such as mining high-utility sequential patterns (HUSPs) in online dynamic log data [12], mobile commerce data [13], and gene regulation data [14]. Large numbers of HUSPs are mined, but some are redundant in some special scenarios. In the marketing example, the merchant needs to consider which product combinations are both highly profitable and can be sold regularly. However, some product combinations that are highly profitable but not frequently sold are considered HUSPs, in which case they are redundant. In recent years, some researchers have added the time constraint problem to HUSPM [15,16]. Considering the periodicity of HUSPs in the quantitative sequence database, some researchers proposed the periodic high-utility sequential pattern mining (PHUSPM) to mining periodic high-utility sequential patterns (PHUSPs) [15,16]. The PHUSPM is also widely used in pattern discovery and knowledge discovery-related fields, such as research on consumer habits, website click-through rate data, financial market analysis, biomedical applications, and mobile computing. The PHUSPM defines the interval of the same pattern in different sequences as a period. The maximum periods of a sequential pattern are generally used to define the pattern’s period. If a sequential pattern’s period is below the user-defined upper limit ( m a x P e r ), it will be regarded as periodic. However, m a x P e r is set too strictly, because if a sequential pattern exceeds the m a x P e r threshold for only one period, it will be discarded. For example, in the market basket analysis, m a x P e r is assumed to be one week and there are customers buying eggs and milk every weekend. This pattern will be periodic, but it will not be considered periodic if the customer skips a week. Therefore, traditional PHUSPM discards some useful and interesting PHUSPs. In addition, when the m a x P e r value of PHUSPM is set too large, the PHUSPs mined will also vary. Obviously, these sequential patterns are not suitable for most practical applications. To sum up, the traditional PHUSPM method suffers from problem that some useful sequential patterns are discarded and some sequential patterns have large periodic fluctuations.
These problems also exist in periodic frequency pattern mining (PFPM) [17,18,19]. In order to provide greater flexibility, Kiran et al. proposed the partial periodic frequency pattern mining (PPFPM) algorithm [20]. This algorithm relaxes the m a x P e r threshold constraint, allowing a specific amount of periods beyond it. In brief, if a pattern is considered periodic, it has no more than x (user-defined) periods that exceed the m a x P e r threshold. Obviously, PFPM is a special situation of x = 0 . Although PPFPM is more adaptable than PFPM, there is still an important problem in that it only verifies whether each period exceeds the m a x P e r threshold. At the same time, PFPM and PPFPM ignore the amount by which each period exceeds the m a x P e r threshold. For example, if the m a x P e r threshold is set to a week, it makes no difference whether some products are discontinued by customers for two weeks or a year. Additionally, none of the models discussed above consider how closely spaced the values of the periods meeting the m a x P e r threshold are. As a result, a pattern can be considered periodic, even though this pattern’s several periods often alternate between values greater or smaller than the m a x P e r threshold. In order to solve the above problems, Fournier–Viger et al. proposed stable periodic frequent pattern mining (SPFPM) [21] and top-k stable periodic frequent pattern mining (TSPIN) [22]. SPFPM and TSPIN propose the concept of stability in PFPM, mining stable periodic patterns that satisfy periodicity while maintaining similar period lengths and thus are more predictable than unstable patterns. However, these algorithms are not suitable for PHUSPM, because they do not consider the utility of items, and PHUSPM has no symmetry compared to these algorithms. Briefly, an item could have several utility values in a sequence, and a corresponding sequence will also have multiple utility values.
This paper suggests the stable periodic high-utility sequential pattern mining (SPHUSPM) algorithm for mining stable periodic high-utility sequential patterns (SPHUSPs). The SPHUSPM has three important contributions.
  • The SPHUSPM provides a brand-new stability method and utilizes the time period information to a greater extent to mine more useful patterns. In the HUSPM research field, the algorithm provides a new research strategy. The addition of multiple methods also makes the mined patterns more interesting and more in line with user requirements. At the same time, in the field of practical application, the algorithm considers the maximum profit and the time period information at the same time, giving decision makers more accurate and efficient decision-making methods.
  • We design a new data structure named PUL-list and a maximum stability pruning strategy in HUSPM (MLPS) to increase the effectiveness of mining. Experiments show that these two methods greatly improve the efficiency of the algorithm.
  • We perform some experiments on six different datasets, which are guaranteed to be able to mine the desired SPHUSPs, while also showing excellent performance in operational efficiency and memory usage efficiency.
The remainder of this paper is structured as follows. Section 2 discusses related work. The SPHUSPM’s preliminaries and problem definitions are introduced in Section 3. In Section 4 the suggested SPHUSPM algorithm is described. Comparative experiments are then presented in Section 5. Lastly, the conclusion of the paper is given.

2. Related Work

2.1. High-Utility Sequential Pattern Mining

High-utility itemset mining (HUIM) [23], which takes into account the quantity of items bought and unit profit, aims to find interesting patterns. Although a growing number of researchers have proposed several HUIM algorithms [24,25,26,27,28,29], they cannot be directly applied to the HUSPM because they did not consider the order of the itemsets.
HUSPM [2,3,4,5] is a research that combines HUIM and SPM, and its goal is to mine HUSPs in quantitative sequence databases. HUSPM was first proposed by Zhou et al., who added the concept of high utility to web log sequence pattern mining [1]. Ahmed et al. proposed a horizontal method called UL and a mode growth method called US [6]. Yin et al. proposed the concept of maximum utility value, the efficient USpan algorithm [3], which used a lexicographic sequence tree (LS-tree) structure to store sequence and utility information and used width and depth pruning strategies. On this basis, the top-k strategy was introduced, and the TUS algorithm [7] was proposed to mine HUSPs. The HUS-span algorithm [4], which made use of the same LS-tree structure as USpan, was put forth by Wang et al. In addition, pruning strategies such as the prefix extension utility (PEU) strategy and reduced sequence utility (RSU) strategy were also used in this algorithm. Lan et al. devised the PHUS algorithm [8], which used a sequential utility table structure to store sequential utility values, and employed a projection-based pruning strategy, and an indexing strategy to reduce search time. Alkan et al. proposed the HuspExt algorithm [9], which used a data matrix storage structure and a PBCG pruning strategy, which was based on a more compact overestimation strategy named CRoM, which could delete an abundance of unpromising candidates, greatly improving the mining efficiency. Recently, Gan et al. devised ProUM [10], an innovative projection-based mining algorithm. To increase mining speed, the algorithm makes use of the utility array structure and the sequence extended utility (SEU) pruning strategy. On this basis, Gan et al. devised the HUSP-ULL algorithm [11], which used a data structure called the utility-linked (UL)-list, which could efficiently record utility and location. To get strict upper constraints on the utility of candidate sequences, the algorithm also suggested the pruning methods irrelevant item pruning strategy (IPS) and look-ahead strategy (LAS).
Although the HUSPM could discover a large number of HUSPs, in some application scenarios, some of the sequential patterns were redundant and useless. The traditional HUSPM method ignored the time constraint problem, so some researchers proposed the PHUSPM algorithm. The next section will review the related work of PHUSPM.

2.2. Periodic High-Utility Sequential Pattern Mining

Some researchers have developed algorithms to mine periodic frequent patterns (PFPs) in transaction databases in the area of frequent pattern mining (FPM) [20,30,31,32,33]. Most of these algorithms relied on the excellent tree-based data structures to produce an entire collection of periodic–frequent patterns in a transactional database. Researchers have simultaneously suggested certain algorithms to exploit periodic–frequent sequential patterns (PFSPs) in sequence databases [34,35]. Because the above two types of algorithms used the support of the patterns to mine the corresponding patterns, ignoring the utility problem of the items, these algorithms could not find high-utility and periodic patterns.
In the field of HUIM, researchers have devised some methods for mining periodic high-utility itemsets (PHUIs) in the quantitative transaction database [36,37,38]. PHM [36] is an approach that Fournier–Viger et al. suggested for mining PHUIs in quantitative transaction datasets. The algorithm created a new class of pattern called as periodic high-utility itemsets by fusing the ideas of periodic itemsets and high-utility itemsets. They proposed the minimum periodicity and the average periodicity as two new measurements to more precisely assess periodic behavior. For finding short-period high-utility itemsets (SPHUIs) [37,38] in quantitative transaction databases, Lin et al. suggested two techniques.
In the field of HUSPM, there are currently only two algorithms to mine PHUSPs in the quantitative sequence database. Dinh et al. proposed an algorithm named PHUSPM [15] by adding the method of periodicity to HUSPM for the first time. However, this algorithm did not design the special data structure and pruning strategy, so the algorithm was not efficient. After that, Dinh et al. [16] suggested the PUSOM algorithm based on the original algorithm, which designed a data structure called PUSP and used the maximum periodic pruning (MPP) strategy. However, the maxPer threshold set by this algorithm was too strict, and some useful sequential patterns were discarded and some patterns had large periodic fluctuation (unstable) problems. These problems also exist in PFPM [18,19,20], and some researchers propose the concept of stability to solve this problem. The following sections will review related work on SPFPM.

2.3. Stable Periodic Frequent Pattern Mining

Most research about PFP mining has evaluated the periodic behavior of patterns by comparing them with a m a x P e r threshold, but ignored the extent to which these periods exceed m a x P e r [18,19,20,36]. In order to find patterns with stable periodic behavior, Fournier–Viger et al. [21] proposed to mine a novel class of periodic frequent patterns in transaction databases, named stable periodic frequent patterns (SPFPs). The algorithm was called stable periodic frequent pattern mining (SPFPM). On this basis, in order to address the issue that the minimum support threshold is difficult to set, Fournier–Viger et al. [22] suggested an algorithm named top-K stable periodic patterns (TSPIN). Although the above algorithms could mine patterns with stable periodic behavior in transaction databases, they could not be directly applied to HUSPM. Because they do not take into account the order of itemsets in a sequence, that is, asymmetry, nor the utility of patterns.
  • In light of the above, we list the limitations of the previously proposed work.
  • In the PHUSPM algorithm, it is difficult to set the m a x P e r threshold accurately. Some patterns have a few periodic fluctuations. However, this situation has little impact on the decision, and they are still useful patterns. If m a x P e r is set too small, these interesting patterns will be ignored. If m a x P e r is set too large, the mined patterns will have unstable periods.
  • Because the SPFPM algorithm is designed specifically for FPM, it cannot be directly applied to HUSPM. In short, it did not take into account the order between itemsets and the utility values of the items.
To resolve the aforementioned issues, we suggest a new stabilization method to discover stable periodic high-utility sequential patterns (SPHUSPs). We will define the SPHUSPM problem definitions and provide its preliminary information in the next part.

3. Preliminaries and Problem Definitions

To help the reader better understand the topic, we summarized symbols that appear in the definition section into Table 1.
Let I = { i 1 , i 2 , , i M } be a finite set containing M unique items. A q-item is denoted as ( i k , q k ) , which represents the item i k I ( 1 k M ) and its purchase quantity (internal utility). Each item has a weight to represent importance or profit per unit, which is called external utility and is denoted by p ( i k ) . A q-itemset X = [ ( i 1 , q 1 ) ( i 2 , q 2 ) ( i m , q m ) ] is a set of q-items. Without loss of generality, the order of q-items in a q-itemset is in lexicographic order(≺). A q-sequence s = < X 1 X 2 X n > is an order list of itemsets. A quantitative sequence database S = { s 1 , s 2 , , s N } is a set of q-sequences wherein each q-sequence has a unique identifier called s i d . Table 2 is a quantitative sequence database. The external utility (profit) of each item in I is shown in Table 3. All the examples in this article are from this quantitative sequence database.
Definition 1.
Let X a = [ ( i a 1 , q a 1 ) ( i a 2 , q a 2 ) ( i a m , q a m ) ] and X b = [ ( i b 1 , q b 1 ) ( i b 2 , q b 2 ) ( i b m , q b m ) ] be two q-itemsets, where i a k I ( 1 k m ) and i b k I ( 1 k m ) . If there exist positive integers 1 ≤ j 1 ≤ j 2 ≤…≤ j m ≤ m’, suchthat i a 1 = i b j 1 q a 1 = q b j 1 , i a 2 = i b j 2 q a 2 = q b j 2 ,…, i a m = i b j m q a m = q b j m , then X b is said to contain X a , whichis denoted as X a X b .
For example, the q-itemset [ ( b , 3 ) ( d , 2 ) ( e , 2 ) ] in q-sequence s 2 contains q-itemsets ( b , 3 ) , ( d , 2 ) , ( e , 2 ) , [ ( b , 3 ) ( d , 2 ) ] , [ ( b , 3 ) ( e , 2 ) ] , [ ( d , 2 ) ( e , 2 ) ] , and [ ( b , 3 ) ( d , 2 ) ( e , 2 ) ] .
Definition 2.
Let A = < A 1 A 2 A n > and B = < B 1 B 2 B n > ( n n ) be the two q-sequences, where A α , B β are q-itemsets ( 1 α n , 1 β n ) . If there exists positive integers 1 j 1 j 2 j n n , such that A 1 B j 1 , A 2 B j 2 , , A n B j n , then A is a q-subsequence of B and B is a q-supersequence of A, denoted as A B .
For example, the q-sequences < [ ( a , 3 ) ( f , 2 ) ] , [ ( a , 5 ) ] , [ ( e , 2 ) ] > and < [ ( a , 3 ) ( b , 1 ) ] , [ ( a , 5 ) ( c , 2 ) ( g , 5 ) ] , [ ( b , 3 ) ( d , 2 ) ] > are two q-subsequences of s 2 .
Definition 3.
The utility of a q-item (i, q) in a q-sequence s is denoted and defined as
u ( i , q ) = p ( i ) × q ( i ) .
The utility of a q-itemset X in a q-sequence s is denoted and defined as
u ( X ) = k = 1 m u ( i k , q k ) .
The utility of a q-sequence s is denoted and defined as
u ( s ) = j = 1 n u ( X j ) .
For example, the utility of the q-item c in q-sequence s 4 (e.g., c 1 ) is u ( c , 3 ) = 4 × 3 = 12 . The utility of the q-itemset [ ( b , 2 ) ( c , 3 ) ] in q-sequence s 4 is u ( [ ( b , 2 ) ( c , 3 ) ] ) = u ( b , 2 ) + u ( c , 3 ) = 3 × 2 + 4 × 3 = 18 . The utility of the q-sequence s 4 is u ( s 4 ) = u ( [ ( b , 2 ) ( c , 3 ) ] + u ( [ ( a , 5 ) ( e , 1 ) ] ) + u ( [ ( b , 4 ) ( d , 3 ) ( e , 5 ) ] ) = 18 + 6 + 23 = 47 .
Definition 4.
Given a q-sequence s = < ( i 1 , q 1 ) ( i 2 , q 2 ) ( i n , q n ) > and a sequence t = < t 1 t 2 t m > , s is said to match t if n = m and i k = t k for 1 k n , denoted as t ∼ s.
For example, t = < ( a b c f ) ( a c g ) ( b d e ) > s 2 .
Definition 5.
The sequence utility of a sequence t = < t 1 t 2 t m > in a q-sequence s = < X 1 X 2 X n > is denoted and defined as
v ( t , s ) = s t s s u ( s ) .
The utility of t in a q-sequence database S is denoted as
v ( t ) = s S v ( t , s ) .
For example, the utility of the sequence t = < g b > in the q-sequence s 1 is calculated as v ( t , s 1 ) = { u ( < ( g , 3 ) ( b , 2 ) > ) } = { 12 } . The utility of t shown in Table 2 is v ( t ) = { v ( t , s 1 ) , v ( t , s 2 ) , v ( t , s 3 ) , v ( t , s 4 ) , v ( t , s 5 ) , v ( t , s 6 ) } = { 12 , 19 , 16 , 16 } .
Definition 6.
The maximum utility of a sequence t in a q-sequence s is denoted and defined as
u m a x ( t , s ) = m a x { v ( t , s ) } .
The maximum utility of a sequence t in a q-sequence database S is denoted and defined as
u m a x ( t ) = u m a x ( t , s ) , s S .
For example, the maximum utility of the sequence t = < g b > in the sequence database S shown in Table 2 is u m a x ( t ) = u m a x ( < g b > , s 1 ) + u m a x ( < g b > , s 2 ) + u m a x ( < g b > , s 3 ) = 12 + 19 + 16 = 47 .
Definition 7.
Given two q-sequences s and s′, if s s , the extension of s in s’ is said to be the rest of s’ after s, and is denoted as < s s > r e s t . Given a sequence t and a q-sequence s, if t s k s k s ( t s ) , the extension of t in s is the rest of s after s k , which is denoted as < s t > r e s t , where s k is the first match of t in s.
For example, given a sequence t = < [ a c ] > . There exist two matches of t in s 2 . The first one is < [ ( a , 3 ) ( c , 3 ) ] ] > . Thus, < s t > r e s t = < [ ( f , 2 ) ] , [ ( a , 5 ) ( c , 2 ) ( g , 5 ) ] , [ ( b , 3 ) ( d , 2 ) ( e , 2 ) ] > .
Definition 8.
The set of extension items of a sequence t in a quantitative sequential database D is denoted as I ( t ) r e s t and defined as
I ( t ) r e s t = { i j | i j < s t > r e s t t s s D } .
For example, I ( < [ a ] , [ b ] > ) r e s t = { c , d , e } .
Definition 9.
The remaining utility of a sequence t in a q-sequence s is denoted as r u ( t ) and defined as
r u ( t , s ) = u < s t > r e s t ( t s ) = i j < s t > r e s t u ( i j ) .
For example, given a sequence t = < a b > and a q-sequence s 1 in Table 2, the extension of t in s 1 is r e s t = < [ ( e , 1 ) ] , [ ( d , 3 ) ] > . The remaining utility is r u ( < a b > , s 1 ) = u < s 1 < a b > > r e s t = u ( e , 1 ) + u ( d , 3 ) = 1 + 6 = 7 .
Definition 10.
A sequence t is said to be a high-utility sequential pattern if u m a x ( t ) m i n u t i l ( o r ξ ) , where m i n u t i l ( o r ξ ) is a given a user-specified minimum utility threshold.
Definition 11.
Let there be a q-sequence database S = { s 1 , s 2 , , s n } and a sequence t. The set of q-sequences containing t is denoted as
S ( t ) = { s α 1 , s α 2 , , s α k } , 1 α 1 α 2 α k n .
For example, Table 4 shows the occurrences of items in the q-sequence database S (Table 2). The list of q-sequences containing the sequences < a b > and < ( a b ) > are respectively S ( < a b > ) = { s 1 , s 2 , s 3 , s 4 , s 5 } and S ( < ( a b ) > ) = { s 1 , s 2 , s 3 , s 5 , s 6 } .
Definition 12.
Let there be two q-sequences s α , s β and a sequence t, such that t s s s α s α S ( t ) and t s s s β s β S ( t ) . s α and s β are said to be consecutive with respect to t if there is not a q-sequence s γ S ( t ) , such that α < γ < β .
The period of two consecutive q-sequence s α and s β is denoted and defined as
p e ( s α , s β ) = β α .
In a word, p e ( s α , s β ) is the number of q-sequences between s α and s β .
For example, The sequence < ( a b ) > appears in s 1 , s 2 , s 3 , s 5 and s 6 . Hence, p e ( s 1 , s 2 ) = 2 1 = 1 , p e ( s 2 , s 3 ) = 3 2 = 1 , p e ( s 3 , s 5 ) = 5 3 = 2 , p e ( s 5 , s 6 ) = 6 5 = 1 .
Definition 13.
Let there be a sequence t and S ( t ) = { s α 1 , s α 2 , , s α k } , w h e r e 1 α 1 α 2 α k n . The periods of a sequence t is a list of periods denoted and defined as
p e s ( t ) = 1 ρ k + 1 p e ( s α ρ 1 , s α ρ ) = { p e s ( t , 0 ) , p e s ( t , 1 ) , p e s ( t , | S ( t ) | ) } , α 0 = 0 , α k + 1 = n .
For example, the sequence < a b > has p e s ( < a b > ) = { 1 , 1 , 1 , 1 , 1 , 1 } . The sequence < ( a b ) > has p e s ( < ( a b ) > ) = { 1 , 1 , 1 , 2 , 1 , 0 } .
In PHUSPs, the mining algorithm PHUSPM and PUSOM proposed by Dinh et al. [15,16], three periodicity measures are used to assess the periodicity of HUSPs in sequence databases.
Definition 14.
The maximum periodicity, minimum periodicity, and average periodicity of a sequence t are denoted and defined respectively as
m a x p e r ( t ) = m a x ( p e s ( t ) ) ,
m i n p e r ( t ) = m i n ( p e s ( t ) ) ,
a v g p e r ( t ) = x p e s ( t ) / | p e s ( t ) | .
For example, the periods of < ( a b ) > are p e s ( < ( a b ) > ) = { 1 , 1 , 1 , 2 , 1 , 0 } . Thus, m a x p e r ( < ( a b ) > ) = 2 , m i n p e r ( < ( a b ) > ) = 0 , and a v g p e r ( < ( a b ) > ) = 6 / 6 = 1 .
Definition 15.
Let there be five positive user-specified thresholds: m i n u t i l ( o r ξ ) , m i n A v g , m a x A v g , m i n P e r , and m a x P e r . A sequence t is a periodic high-utility sequential pattern if t is a HUSP (it satisfies Definition 8) and m i n A v g a v g p e r ( t ) m a x A u g ,   m i n p e r ( t ) m i n P e r and m a x p e r ( t ) m a x P e r .
Fournier–Viger et al. introduced a novel model based on the cumulative sum in order to find frequent patterns having a stable periodic behavior in PFP mining [21,22]. The main function of the model is to determine whether all periods of patterns are stable or not. Experiments on SPP [21] and TSPIN [22] algorithms show that this model is flexible and practical. However, this model is not designed for mining patterns on sequential databases. For the problem of this paper, we have modified this model to accommodate sequential databases. This model evaluates the stability of a pattern by calculating the cumulative sum of the difference between each period of the pattern and the m a x P e r . We define a method called lability to determine the periodic behavior of patterns.
Definition 16.
The lability of a sequence t is a list of values denoted as
l a ( t ) = < l a ( t , 0 ) , l a ( t , 1 ) , , l a ( t , | S ( t ) | ) > .
The l a ( t ) list contains | S ( t ) | + 1 values. In other words, | l a ( t ) | = | S ( t ) | + 1 = | p e s ( t ) | . Each lability value in l a ( t ) is no less than zero. The first lability value of t is defined as l a ( t , 0 ) = m a x ( 0 , p e s ( t , 0 ) m a x p e r ) . Then, the i-th lability value of t for i > 0 is defined based on the the previous lability value as l a ( t , i ) = m a x ( 0 , l a ( t , i 1 ) + p e s ( t , i ) m a x p e r ) . Thus, lability values are calculated as a cumulative sum. Note that the above definition of lability can also be rewritten more concisely as follows:
l a ( t , i ) = m a x ( 0 , l a ( t , i 1 ) + p e s ( t , i ) m a x p e r ) , l a ( t , 1 ) = 0 .
For example, for the database of the running example and m a x P e r = 1 , the periods of t = < ( a b ) > are p e s ( < ( a b ) > ) = { 1 , 1 , 1 , 2 , 1 , 0 } . Because the t has six periods, it also has six lability values. The first lability value of t is l a ( t , 0 ) = m a x ( 0 , p e s ( t , 0 ) m a x p e r ) = m a x ( 0 , 1 1 ) = 0 . Then, the following lability values are l a ( t , 1 ) = 0 ,   l a ( t , 2 ) = 0   l a ( t , 3 ) = 1 ,   l a ( t , 4 ) = 1 , and l a ( t , 5 ) = 0 . Thus, the lability of itemset { d } is l a ( < ( a b ) > ) = { 0 , 0 , 0 , 1 , 1 , 0 } .
For a sequence t, its l a ( t ) corresponds one-to-one to the value in p e s ( t ) . If the value in p e s ( t ) is smaller, the calculated value in l a ( t ) will also show a smaller value. Conversely, the calculated value in l a ( t ) will also be relatively large. In addition, if a large value appears in l a ( t ) , the value after that value may also become large. If the value of l a ( t ) is 0 or tends to 0, it indicates that the sequence t has good periodic stability. Conversely, the sequence t has unstable periodic behaviors.
Definition 17.
The maximum lability of a sequence t is defined as
m a x l a ( t ) = m a x ( l a ( t ) ) .
The maximum lability of a sequence t is also called the stability of t. For example, as the lability values of sequence t = < ( a b ) > is l a ( < ( a b ) > ) = { 0 , 0 , 0 , 1 , 1 , 0 } , then m a x l a ( < ( a b ) > ) = 1 .
Definition 18.
Let there be a sequence database D, a sequence t, three user-defined thresholds minimum utility threshold (minutil or ξ) > 0, maximum periods (maxPer) > 0, and maximum lability threshold (maxLa) ≥ 0. The problem of mining the stable periodic high-utility sequential patterns in D consists of enumerating each sequence t in D such that m a x l a ( t ) m a x L a and u ( t ) ξ .
To better understand the above definitions, we will give a example. At the same time, the pattern in this example will provide a case that is ignored by other algorithms. Assume that m a x P e r = 2 and m a x L a = 2 are the limiting conditions. Given a sequence t = < ( c g ) > , it appears in sequences s 1 , s 2 and s 3 in the quantitative sequence database shown in Table 2. Consequently, S ( < ( c g ) > ) = { s 1 , s 2 , s 3 } . We get p e s ( < ( c g ) > ) = { 1 , 1 , 1 , 3 } by Definition 13. So, m a x p e s ( < ( c g ) > ) = 3 . In the traditional PHUSPM algorithm, the sequence t = < ( c g ) > will not be considered periodic according to the constraint m a x P e r = 2 . Obviously, this pattern appears periodically in the first half of the database. Therefore, this pattern is useful for some applications. However, due to strict restrictions, this pattern is ignored by the traditional algorithm. In this article, we use the stability strategy that solves this problem. By Definition 16, we get that l a ( < ( c g ) > ) = { 0 , 0 , 0 , 1 } . According to m a x l a ( < ( c g ) > ) = 1 < m a x L a = 2 , we get that t = < ( c g ) > is a useful pattern. It is clear from this example that this method can find interesting patterns that traditional methods miss.

4. Proposed Algorithms

4.1. The Data Structure

In 2019, Gan et al. [11] proposed the HUSP-ULL algorithm, which utilized a utility-linked (UL)-list structure and a lexicographic sequence (LS)-tree for mining HUSPs. This paper also uses LS-tree and designs a new data structure based on the UL-list, namely period utility-linked (PUL)-list structure. This structure can quickly access period and utility information, which greatly improves the operating efficiency of the algorithm.

4.1.1. Lexicographic Sequence Tree and Concatenations

Each node in the lexicographic sequence tree [39] represents a potential SPHUSP candidate. To identify an SPHUSP candidate, the utility value in the node can be evaluated to the minimal utility threshold and the stability value to the maximum lability threshold.
In the LS-tree nodes, all of the original database’s sequences are converted to UL-lists. The designed algorithm utilizes two common sequence mining operations named I-concatenation and S-concatenation to create new sequences (child nodes) in the LS-tree.
Definition 19.
Given a sequence t and an item i j , the I-concatenation of t with i j consists of appending i j to the last itemset of t, denoted as < t i j > I c o n c a t e n a t i o n . The S-concatenation of t with an item i j consists of adding i j to a new itemset appended after the last itemset of t, denoted as < t i j > S c o n c a t e n a t i o n .
For example, given a sequence t = < [ b ] , [ c ] > and a item a, < t a > I c o n c a t e n a t i o n = < [ b ] , [ a c ] > and < t a > S c o n c a t e n a t i o n = < [ b ] , [ c ] , [ a ] > .
It is clear that after executing the I-concatenation operation, the sequence’s itemsets count stays the same; however, after doing the S-concatenation action, the itemsets count rises by one. All potential sequence patterns in the search space for SPHUSPM can be constructed based on these two methods.
The algorithm’s search procedure can be compared to the procedure of gradually constructing a LS-tree. The method first searches the database for a set of 1-sequences that meet the minimum utility and maximum lability thresholds. The LS-tree is then explored by using a depth-first search approach, starting from the 1-sequence. We use the I-concatenation and S-concatenation operations to obtain the child nodes of each node. Finally, the complete sequence database set is obtained.

4.1.2. The Period-Utility-Linked List Structure

The burden increases during the mining process because the program must repeatedly scan the original database. To solve this problem, this study uses the same UL-list structure as the HUSP-ULL algorithm [11] to store the relevant information of each q-sequence in the q-sequence database, and then constructs the PUL-list to discover SPHUSPs. The PUL-list contains the SID of each q-sequence where the candidate sequence is located and the suffix sequence information in the q-sequence. In order to facilitate candidate sequence expansion and utility calculation, the PUL-list also stores the first occurrence position of each different item of the suffix sequence in the q-sequence, that is, index information. In addition, in order to easily measure the stability of the candidate pattern, the order of the q-sequence (period information) where the candidate sequence is located in the original database is also included in the PUL-list.
In the UL-list, the utility and position (UP) information and header table of the q-sequence are included. Among them, the UP information records the item name, the utility of the item, the remaining utility of the item, and the position where the item appears next in the sequence. The heard table stores the name of each item in the q-sequence and the position where the item first appears in the sequence.
For example, Table 5 shows the UL-list of s 1 . In UP information, the first item in s 1 is a, the utility is 1, the remaining utility is 41, and the next position is empty, which means that it no longer appears in this sequence. In the heard table, the first occurrence location of the different items in s 1 is stored. The information ( a , 1 ) indicates that the position where item a first appears in s 1 is the first position.
The PUL-list is a projection database of candidate sequences in the mining process. It contains UP information, heard table and periodic information of candidate sequences of multiple sequences. Its purpose is to facilitate the calculation of periodicity and stability. The PUL-list takes < ( a b ) > as an example, as shown in Table 6. As indicated in Table 6, the PUL-list takes < ( a b ) > as an example. In the PUL-list of < ( a b ) > , the projection information of sequence < ( a b ) > in each q-sequence is stored; for example, the projection of sequence < ( a b ) > in s 2 is < [ ( c , 3 ) ( f , 2 ) ] , [ ( a , 5 ) ( c , 2 ) ( g , 5 ) ] , [ ( b , 3 ) ( d , 2 ) ( e , 2 ) ] > . The periodic information in this table records the index of the sequence containing the sequence < ( a b ) > in original databases (i.e., the quantitative sequence database), and < 1 , 2 , 3 , 5 , 6 > means the sequence < ( a b ) > appears in the first, second, third, fifth and sixth sequences.
The SPHUSPM algorithm uses the PUL-list structure to simply access utility, location, and period information instead of repeatedly scanning the original database. The access efficiency and mining efficiency of the algorithm are both significantly increased by this structure.

4.2. Pruning Strategy

Numerous candidates are produced during the mining process. Therefore, this will increase the chance of combinatorial explosion, which in turn will cause the algorithm to run slower, so we need to introduce several pruning strategies to solve this problem.

4.2.1. The Downward Closure Property of Upper Bound

Definition 20.
The sequence-weighted utilization (SWU) [3] of a sequence t in a quantitative sequential database D is denoted as SWU(t) and defined as
S W U ( t ) = s t s s s D u ( s ) .
For example, S W U ( < a > ) = u ( s 1 ) + u ( s 2 ) + ( s 3 ) + ( s 4 ) + ( s 5 ) + ( s 6 ) = 42 + 68 + 56 + 47 + 52 + 49 = 314 and S W U ( < f > ) = u ( s 2 ) + u ( s 3 ) + ( s 6 ) = 68 + 56 + 49 = 173 .
Theorem 1.
Given a quantitative sequential database D and two sequences t and t’. If t ⊆ t’, then
S W U ( t ) S W U ( t ) .
Proof. 
Because t t , S W U ( t ) = s t s s s D u ( s ) s t s s s D u ( s ) = S W U ( t ) .    □
Theorem 2.
Given a quantitative sequential database D and a sequence t, it can be obtained that
u ( t ) S W U ( t ) .
Proof. 
Because u ( t , s ) u ( s ) , we can obtain that u ( t ) = s D v ( t , s ) s t s s s D u ( s ) = S W U ( t ) .    □
From the above SWU definition and theorems, the utility of sequence t must be less than the minimum utility threshold if the SWU value of t is less than that threshold. When this happens, the utility of any supersequence of t will also be below this threshold. SWU is able to eliminate many candidates that are unqualified as a result. In actuality, the utility of a sequence t is typically significantly smaller than its SWU value, so a tighter upper bound also needs to be introduced.
We introduce two strategies from the HUSP-ULL algorithm [11] to generate a stricter upper bound, which is built on the prefix extension utility (PEU) model.
Definition 21.
The PEU of a sequence t in a q-sequence s is denoted as P E U ( t , s ) and defined as
P E U ( t , s ) = m a x { u ( s k ) + u ( < s s k > ) r e s t t s k s k s } .
Definition 22.
The PEU of a sequence t in D is denoted as P E U ( t ) and defined as
P E U ( t ) = s D { P E U ( t , s ) | t s } .
Theorem 3.
Given a quantitative sequential database D, and two sequences t and t’. If t ⊆ t’, we obtain
P E U ( t ) P E U ( t ) .
Theorem 4.
Given a quantitative sequential database D and a sequence t, we can obtain
u ( t ) P E U ( t ) .
The proofs of Theorems 3 and 4 can be found in [11]. Theorems 3 and 4 demonstrate that if the sequence t’s PEU value is below the minimal utility threshold, then the supersequence of t’s PEU value is similarly below the minimum utility threshold. The utility of t and the utility of the supersequence of t are both less than the minimum utility threshold if the PEU value of the sequence t is less than the minimum utility threshold.

4.2.2. Pruning Strategies

The candidate sequence t can produce a lot of candidate sequences when doing I-concatenations and S-concatenations. In order to reduce a mount of candidate sequences, this research introduces the look-ahead strategy (LAS) and the irrelevant item pruning strategy (IPS) in the HUSP-ULL algorithm [11] to eliminate hopeless candidates in advance.
Theorem 5.
Given a sequence t and a quantitative sequential database D, two situations are considered to generate a supersequence.
(1) If i j is an I-concatenation candidate item of t, the maximal utility of < t i j > I c o n c a t e n a t i o n is no more than s D { P E U ( t , s ) | < t i j > I c o n c a t e n a t i o n s } .
(2) If i j is a S-concatenation candidate item of t, the maximal utility of < t i j > S c o n c a t e n a t i o n is no more than s D { P E U ( t , s ) | < t i j > S c o n c a t e n a t i o n s } .
The proof of Theorem 5 can be referred to [11].
Look-Ahead Strategy (LAS):
(1)
If i j is an I-concatenation candidate item of t and
s D { P E U ( t , s ) | < t i j > I c o n c a t e n a t i o n s } is less than the minimum utility threshold, i j should be removed from C I (the set of candidate items for I-concatenation with t).
(2)
If i j is a S-concatenation candidate item of t and
s D { P E U ( t , s ) | < t i j > S c o n c a t e n a t i o n s } is less than the minimum utility threshold, i j should be removed from C S (the set of candidate items for S-concatenation with t).
Theorem 6.
For any sequence t and item i j I ( t ) r e s t , the maximal utility of < t i j > I c o n c a t e n a t i o n s or < t i j > S c o n c a t e n a t i o n s is no more than s D { P E U ( t , s ) | ( < t i j > I c o n c a t e n a t i o n s ) ( < t i j > S c o n c a t e n a t i o n s ) } .
Proof of Theorem 6 can be referred to [11].
Irrelevant Item Pruning Strategy (IPS):
Given a sequence t and an item i j I ( t ) r e s t , if s D { P E U ( t , s ) | ( < t i j > I c o n c a t e n a t i o n s ) ( < t i j > S c o n c a t e n a t i o n s ) } is less than the minimum utility threshold, i j is called an irrelevant item of t and should be removed from the utility linked lists of t and t’s supersets.
This algorithm uses LAS and IPS pruning strategies to remove a large number of candidates, which greatly improves the running efficiency of the algorithm. The places where these two strategies are used will be marked in the algorithm section.
Because the pruning strategies mentioned above are all pruning in terms of utility, in order to further improve the mining speed of the algorithm, we introduced the maximum lability pruning (MLP) strategy in the TSPIN algorithm [22]. However, this strategy is only applicable to transactional databases, so we modified it to make it suitable for quantitative sequence databases.
Theorem 7.
For any two sequences t t D , the relationship m a x l a ( t ) m a x l a ( t ) holds.
Proof. 
Because t t , it follows that S ( t ) S ( t ) . In the case where, S ( t ) = S ( t ) , the periods of t and t are the same. Hence, l a ( t ) = l a ( t ) and m a x l a ( t ) = m a x l a ( t ) . In the case where S ( t ) S ( t ) , then for each sequence { s z | s z S ( t ) s z S ( t ) } , the corresponding p e s ( t ) will have smaller numbers. Thus, m a x l a ( t ) m a x l a ( t ) .    □
Theorem 8.
For a sequence database D, if m a x l a ( t ) > m a x L A for an sequence t, then t and its supersequences don’t have stability.
Hence, the part of the search space containing t and its supersets can be ignored.
Proof. 
According to the definition of SPHUSPM, if m a x l a ( t ) > m a x L a , then t is not an SPHUSP. Then, any supersequence t of t is also not a SPHUSPM based on Theorem 7.    □

4.3. The SPHUSPM Algorithm

Based on the proposed problem definition, the PUL-list structure and SWU, LAS, IPS, and MLPS strategies, this section proposes the stable periodic high-utility sequential pattern mining (SPHUSPM) algorithm. The framework of the proposed SPHUSPM algorithm is shown in Figure 1, which shows the main parts of the method and their interworking. This algorithm is the first in the field of HUSPM to mine sequential patterns with stable periods. This work is challenging because SPHUSPs have not been defined before, and stability methods cannot be directly applied to quantitative sequence databases.
The pseudo-code of the algorithm 1: SPHUSPM will be given below.
First, scan the quantitative sequence database D to calculate u ( s ) and u ( D ) , and construct the PUL-list for each q-sequence s D (Line 1). Initialize the set of SPHUSPs (Line 2). For each item i j D , the algorithm builds a projection database P D ( i j ) to store the transformed PUL-list (Lines 3 to 4). The utility and S W U values for each 1 s e q u e n c e are computed by using the corresponding projection database (Line 5). A 1 s e q u e n c e whose S W U value is not less than the minimum utility threshold will be considered as the candidate SPHUSP (Lines 5 to 12). Therefore, those 1 s e q u e n c e s with lower S W U values will be considered unpromising, and they will be deleted at this step. Then use the PUL-list of ( i j ) to calculate the set of p e s and the set of l a . Then judge the stability of ( i j ) . If not satisfied, both it and its supersequence are considered not to be SPHUSPs, and the MLPS pruning strategy is applied here. The 1 s e q u e n c e utility if not less than the minimum utility threshold will be output as SPHUSPs (Lines 6 to 10). Next, the PGrowth algorithm treats candidates as prefixes for mining more SPHUSPs (Line 12). The program is cited from [11]. In this part, we acquire the projection database P D ( p r e f i x ) of the p r e f i x and C I (candidate item set for I-concatenation), C S (candidate item set for S-concatenation) through IPS and LAS strategies, respectively. Then, we use the items in C I and C S to perform I-concatenation or S-concatenation with the p r e f i x respectively, and then enter the Judge-SPHUSPs program (Lines 13 to 18).
Algorithm 1: SPHUSPM
Input: D, a quantitative sequential database; u t a b l e , a utility table containing the unit profit of each item; m i n u t i l , the minimum utility threshold; m a x P e r , the maximum periods; m a x L a , the maximum lability threshold.
Output: The complete set of SPHUSPs.
Symmetry 14 02032 i001
   The judge-SPHUSPs process (Algorithm 2) first constructs the PUL-list of p r e f i x through the projection database of p r e f i x , and obtains the projection database of p r e f i x (Line 1). Then calculate the PEU value, utility value, the set of p e s , and the set of l a of p r e f i x by P D ( p r e f i x ) (Line 2). If the utility of p r e f i x is not less than the minimum utility threshold and the m a x l a value of p r e f i x is not greater than the m a x L a value, p r e f i x is determined to be a SPHUSP (Lines 4 to 6). If the PEU value of p r e f i x is not less than the minimum utility threshold and the m a x l a value of p r e f i x is not greater than the m a x L a value, p r e f i x will enter the PGrowth process (Line 8). If no candidate sequence is generated, the process ends. Finally, the designed algorithm returns a list of mined SPHUSPs.
Algorithm 2: Judge-SPHUSPs
Input: p r e f i x , P D ( p r e f i x ) , S P H U S P s
Symmetry 14 02032 i002

4.4. Total Computational Complexity

In order to further understand the proposed algorithm, we will analyze the time complexity of the algorithm in the following. N 1 is the number of sequences in the quantitative sequence database. N 2 is the average number of items in a sequence in the quantitative sequence database. N 3 is the number of all candidates produced by the algorithm. N 4 is the average number of occurrences of all candidates.
The SPHUSPM algorithm must scan the original database once when building the projection database and performing width pruning. Therefore, the time complexity of this operation is O ( N 1 × N 2 ) . This algorithm must build LS-tree by expanding the candidates, so the time complexity of this process is O ( N 3 × N 4 ) . All candidates must enter the judge-SPHUSPs algorithm to determine whether they are real SPHUSPs. Therefore, the time complexity of this operation is O ( N 3 ) .
In summary, the time complexity of the SPHUSPM algorithm is O ( N 1 × N 2 + N 3 × N 4 + N 3 ) .

5. Experimental Evaluation

The performance of SPHUSPM is not compared to that of other algorithms because it is the first algorithm for mining SPHUSPs. SPHUSPM is implemented in Java. To evaluate the SPHUSPM algorithm, extensive experiments have been done on a workstation running Windows 10, equipped with an Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 3.41 GHz, and 32 GB of RAM.

5.1. Datasets

Six real datasets [40] are used in the experiment to evaluate the performance of the algorithm.
Sign. The National Center for Sign Language and Gesture Resources at Boston University developed Sign, a real-world dataset of sign language utterance sequences. Every utterance in the dataset corresponds to a video segment that has been meticulously transcribed.
Bible. By converting the Bible into a collection of item sequences, the Bible can be viewed as a real-world dataset.
Kosarak10k. Kosarak10k is a subset of the original Kosarak dataset. This is a real-world dataset made up of click-stream data from a news portal in Hungary.
Leviathan. Leviathan is a conversion of Thomas Hobbes’ Leviathan novel (1651) to a sequence of items (words).
yoochoose-buys. YOOCHOOSE GmbH created the yoochoose-buys commercial dataset to assist RecSys Challenge 20,151 participants.
MSNBC. The click-stream data from the MSNBC website has been transformed from the original data from the UCI repository to create the MSNBC dataset. Only 31,790 sequences remain after the smallest ones were eliminated.
Table 7 and Table 8, which show the parameters and properties of these datasets, respectively. The above datasets can be downloaded from [40].

5.2. Execution Time

Figure 2 shows the experimental results of the execution time of the SPHUSPM algorithm. The experiment was conducted on the six real datasets mentioned above. We observe the performance of the algorithm on different datasets with different thresholds by setting the values of m i n u t i l , m a x P e r , and m a x L a . In Figure 2, the x axis represents m i n u t i l and the y axis represents execution time. P L indicates m a x P e r = P and m a x L a = L . The following observations can be drawn from Figure 2.
When the execution time of the algorithm is high, the value of m a x L a is usually set high. The reason is that with the increase of m a x L a , the stability value of sequence patterns also decreases. In this case, SPHUSPM must consider more sequential patterns in mining.
At the same time, decreasing the value of m i n u t i l and increasing the value of m a x L a will relax the threshold. Therefore, in these cases, SPHUSPM must consider more sequence patterns and increase the execution time of the algorithm.
On the sparse dataset, the SPHUSPM method performs better than it does on the dense dataset. Because the occurrence probability of the same sequence pattern in sparse datasets is low, the sequence pattern stability is low. Thus, a large number of unstable patterns will not be considered.

5.3. Pattern Count

Figure 3 shows the experimental results of the pattern count of SPHUSPM algorithm. The experiment was conducted on the six real datasets mentioned above. We observe the performance of the algorithm on different datasets with different thresholds by setting the values of m i n u t i l , m a x P e r , and m a x L a . In this figure, the x axis represents the minimum utility threshold ( m i n u t i l ), and the y axis represents the number of SPHUSPs generated. P L means m a x P e r = P , m a x L a = L . The following observations can be drawn from Figure 3.
Fix the m a x L a value and the m a x P e r value, and increase the m i n u t i l value. Alternatively, fix the m i n u t i l value and the m a x P e r value, and decrease the m a x L a value. In both cases, the number of sequential patterns is reduced because the criteria for the threshold is increased.
When m i n u t i l remains unchanged, the number of SPHUSPs increases with m a x P e r and m a x L a , indicating a large number of potentially stable patterns in the HUSPs in datasets.

5.4. Memory Usage

Table 9 shows the memory usage of the SPHUSPM algorithm with different m i n u t i l , m a x P e r , and m a x L a values set on six datasets. The results show that the SPHUSPM algorithm uses less memory when the high m i n u t i l value, the low m a x P e r value and the low m a x L a value are set. This is reasonable because the algorithm can prune more sequential patterns in these cases.

5.5. Effectiveness of Pruning Strategies

To test the performance of the improved maximum lability pruning strategy in sequential pattern mining (MLPS), we conduct experiments on six datasets. Figure 4 shows the number of candidates generated by the SPHUSPM when setting different values of m i n u t i l , m a x P e r , and m a x L a on six datasets. In this figure, the x axis represents the minimum utility threshold ( m i n u t i l ), and the y axis represents the number of generated candidates. P L means m a x P e r = P , m a x L a = L . In addition, M L P P L indicates that the M L P S strategy is used. The following observations can be drawn from Figure 4.
MLPS can eliminate a large number of candidates in advance and show excellent performance on all datasets. The reduction of the number of candidates makes the execution time of the algorithm greatly reduced and the search space is greatly reduced. Moreover, with low m a x P e r and low m a x L a , fewer candidates will be generated.

6. Conclusions

In this paper, the problem of mining stable periodic high-utility sequential patterns is defined, and the stability method is proposed for the first time in the mining of high-utility sequential patterns. In the HUSPM research area, this method provides a more flexible, precise decision-making mining strategy. In application areas, this method can be widely used in pattern discovery and knowledge discovery-related fields, such as research on consumer habits, website click-through rate data analysis, financial market analysis, biomedical applications, and mobile computing. To efficiently discover all SPHUSPs, an efficient SPHUSPM algorithm is designed. After experimental verification, the SPHUSPM algorithm can not only mine the sequence patterns ignored by the traditional algorithm, but also ensure that the mined sequence patterns show stable periodic characteristics in databases. At the same time, the addition of the PUL-list structure and the MLPS strategy accelerates the access speed and reduces the search space, so that the algorithm can ensure that the required sequence pattern can be mined, while also improving the mining efficiency and memory usage efficiency.
In future work, we will introduce the non-redundant strategy [17] and the negative utility strategy [41] based on this algorithm. These strategies are used to improve the accuracy of decision making and expand application scenarios.

Author Contributions

Conceptualization, S.X.; methodology, S.X.; software, S.X.; validation, L.Z.; formal analysis, L.Z.; investigation, L.Z.; resources, L.Z.; data curation, S.X.; writing—original draft preparation, S.X.; writing—review and editing, L.Z.; visualization, S.X.; supervision, L.Z.; project administration, L.Z.; funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was partly supported by the National Natural Science Foundation of China (61806105, 62076143, and 61906104) and the Natural Science Foundation of the Shandong Province (ZR2019BF018, ZR2021QF059).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
PFPMPeriodic Frequency Pattern Mining
PPFPMPartial Periodic Frequency Pattern Mining
SPFPMStable Periodic Frequent Pattern Mining
HUIMHigh-Utility Itemset Mining
HUSPMHigh-Utility Sequential Pattern Mining
HUSPsHigh Utility Sequential Patterns
PHUSPMPeriodic High Utility Sequential Pattern Mining
PHUSPsPeriodic High Utility Sequential Patterns
SPHUSPMStable Periodic High Utility Sequential Pattern Mining
SPHUSPsStable Periodic High Utility Sequential Patterns
m i n u t i l minimum utility threshold
m a x P e r maximum periodicity threshold
m i n P e r minimum periodicity threshold
a v g P e r average periodicity threshold
m a x L a maximum lability threshold
UL-listutility-linked-list
PUL-listperiod-utility-linked-list
LS-treeLexicographic Sequence Tree
SWUSequence Weighted Utilization
PEUPrefix Extension Utility
RSUReduced Sequence Utility
SEUSequence Extended Utility
MPPMaximum Periodic Pruning
MLPMaximum Lability Pruning
MLPSMaximum Lability Pruning in sequential pattern mining
IPSIrrelevant Item Pruning Strategy
LASLook Ahead Strategy

References

  1. Zhou, L.; Liu, Y.; Wang, J.; Shi, Y. Utility-based web path traversal pattern mining. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), Omaha, NE, USA, 28–31 October 2007; pp. 373–380. [Google Scholar]
  2. Truong-Chi, T.; Fournier-Viger, P. A survey of high utility sequential pattern mining. In High-Utility Pattern Mining; Springer: Berlin/Heidelberg, Germany, 2019; pp. 97–129. [Google Scholar]
  3. Yin, J.; Zheng, Z.; Cao, L. USpan: An efficient algorithm for mining high utility sequential patterns. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 660–668. [Google Scholar]
  4. Wang, J.Z.; Huang, J.L.; Chen, Y.C. On efficiently mining high utility sequential patterns. Knowl. Inf. Syst. 2016, 49, 597–627. [Google Scholar] [CrossRef]
  5. Ishita, S.Z.; Ahmed, C.F.; Leung, C.K. New approaches for mining regular high utility sequential patterns. Appl. Intell. 2022, 52, 3781–3806. [Google Scholar] [CrossRef]
  6. Ahmed, C.F.; Tanbeer, S.K.; Jeong, B.S. A Novel Approach for Mining High-Utility Sequential Patterns in Sequence Databases. ETRI J. 2010, 32, 676–686. [Google Scholar] [CrossRef]
  7. Yin, J.; Zheng, Z.; Cao, L.; Song, Y.; Wei, W. Efficiently mining top-k high utility sequential patterns. In Proceedings of the 2013 IEEE 13th international Conference on Data Mining, Dallas, TX, USA, 7–10 December 2013; pp. 1259–1264. [Google Scholar]
  8. Lan, G.C.; Hong, T.P.; Tseng, V.S.; Wang, S.L. Applying the maximum utility measure in high utility sequential pattern mining. Expert Syst. Appl. 2014, 41, 5071–5081. [Google Scholar] [CrossRef]
  9. Alkan, O.K.; Karagoz, P. CRoM and HuspExt: Improving efficiency of high utility sequential pattern extraction. IEEE Trans. Knowl. Data Eng. 2015, 27, 2645–2657. [Google Scholar] [CrossRef]
  10. Gan, W.; Lin, J.C.W.; Zhang, J.; Chao, H.C.; Fujita, H.; Philip, S.Y. ProUM: High utility sequential pattern mining. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 767–773. [Google Scholar]
  11. Gan, W.; Lin, J.C.W.; Zhang, J.; Fournier-Viger, P.; Chao, H.C.; Philip, S.Y. Fast utility mining on sequence data. IEEE Trans. Cybern. 2020, 51, 487–500. [Google Scholar] [CrossRef]
  12. Ahmed, C.F.; Tanbeer, S.K.; Jeong, B.S. Mining high utility web access sequences in dynamic web log data. In Proceedings of the 2010 11th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, London, UK, 9–11 June 2010; pp. 76–81. [Google Scholar]
  13. Shie, B.E.; Yu, P.S.; Tseng, V.S. Mining interesting user behavior patterns in mobile commerce environments. Appl. Intell. 2013, 38, 418–435. [Google Scholar] [CrossRef]
  14. Zihayat, M.; Davoudi, H.; An, A. Top-k utility-based gene regulation sequential pattern discovery. In Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Shenzhen, China, 15–18 December 2016; pp. 266–273. [Google Scholar]
  15. Dinh, T.; Huynh, V.N.; Le, B. Mining periodic high utility sequential patterns. In Proceedings of the Asian Conference on Intelligent Information and Database Systems, Kanazawa, Japan, 3–5 April 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 545–555. [Google Scholar]
  16. Dinh, D.T.; Le, B.; Fournier-Viger, P.; Huynh, V.N. An efficient algorithm for mining periodic high-utility sequential patterns. Appl. Intell. 2018, 48, 4694–4714. [Google Scholar] [CrossRef]
  17. Afriyie, M.K.; Nofong, V.M.; Wondoh, J.; Abdel-Fatao, H. Mining non-redundant periodic frequent patterns. In Proceedings of the Asian Conference on Intelligent Information and Database Systems, Phuket, Thailand, 23–26 March 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 321–331. [Google Scholar]
  18. Amphawan, K.; Surarerks, A.; Lenca, P. Mining periodic-frequent itemsets with approximate periodicity using interval transaction-ids list tree. In Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data Mining, Phuket, Thailand, 9–10 January 2010; pp. 245–248. [Google Scholar]
  19. Fournier-Viger, P.; Lin, C.W.; Duong, Q.H.; Dam, T.L.; Ševčík, L.; Uhrin, D.; Voznak, M. PFPM: Discovering periodic frequent patterns with novel periodicity measures. In Proceedings of the 2nd Czech-China Scientific Conference 2016, Ostrava, Czech Republic, 7 June 2016; IntechOpen: London, UK, 2017. [Google Scholar]
  20. Kiran, R.U.; Venkatesh, J.; Fournier-Viger, P.; Toyoda, M.; Reddy, P.K.; Kitsuregawa, M. Discovering periodic patterns in non-uniform temporal databases. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Chengdu, China, 16–19 May 2022; Springer: Berlin/Heidelberg, Germany, 2017; pp. 604–617. [Google Scholar]
  21. Fournier-Viger, P.; Yang, P.; Lin, J.C.W.; Kiran, R.U. Discovering stable periodic-frequent patterns in transactional data. In Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan, 19–22 July; Springer: Berlin/Heidelberg, Germany, 2019; pp. 230–244. [Google Scholar]
  22. Fournier-Viger, P.; Wang, Y.; Yang, P.; Lin, J.C.W.; Yun, U.; Kiran, R.U. Tspin: Mining top-k stable periodic patterns. Appl. Intell. 2022, 52, 6917–6938. [Google Scholar] [CrossRef]
  23. Gan, W.; Lin, J.C.W.; Fournier-Viger, P.; Chao, H.C.; Hong, T.P.; Fujita, H. A survey of incremental high-utility itemset mining. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1242. [Google Scholar] [CrossRef]
  24. Fournier-Viger, P.; Wu, C.W.; Zida, S.; Tseng, V.S. FHM: Faster high-utility itemset mining using estimated utility co-occurrence pruning. In Proceedings of the International Symposium on Methodologies for Intelligent Systems, Limassol, Cyprus, 29–31 October 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 83–92. [Google Scholar]
  25. Lin, C.W.; Hong, T.P.; Lu, W.H. An effective tree structure for mining high utility itemsets. Expert Syst. Appl. 2011, 38, 7419–7424. [Google Scholar] [CrossRef]
  26. Lin, Y.C.; Wu, C.W.; Tseng, V.S. Mining high utility itemsets in big data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Chengdu, China, 16–19 May 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 649–661. [Google Scholar]
  27. Liu, M.; Qu, J. Mining high utility itemsets without candidate generation. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, Maui, HI, USA, 29 October–2 November 2012; pp. 55–64. [Google Scholar]
  28. Yun, U.; Ryang, H.; Ryu, K.H. High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates. Expert Syst. Appl. 2014, 41, 3861–3878. [Google Scholar] [CrossRef]
  29. Zida, S.; Fournier-Viger, P.; Lin, J.C.W.; Wu, C.W.; Tseng, V.S. EFIM: A highly efficient algorithm for high-utility itemset mining. In Proceedings of the Mexican International Conference on Artificial Intelligence, Mexico City, Mexico, 25–30 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 530–546. [Google Scholar]
  30. Amphawan, K.; Lenca, P.; Surarerks, A. Mining top-k periodic-frequent pattern from transactional databases without support threshold. In Proceedings of the International Conference on Advances in Information Technology, Bangkok, Thailand, 1–5 December 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 18–29. [Google Scholar]
  31. Kiran, R.U.; Kitsuregawa, M.; Reddy, P.K. Efficient discovery of periodic-frequent patterns in very large databases. J. Syst. Softw. 2016, 112, 110–121. [Google Scholar] [CrossRef]
  32. Surana, A.; Kiran, R.U.; Reddy, P.K. An efficient approach to mine periodic-frequent patterns in transactional databases. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Shenzhen, China, 24–27 May 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 254–266. [Google Scholar]
  33. Tanbeer, S.K.; Ahmed, C.F.; Jeong, B.S.; Lee, Y.K. Discovering periodic-frequent patterns in transactional databases. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 242–253. [Google Scholar]
  34. Han, J.; Dong, G.; Yin, Y. Efficient mining of partial periodic patterns in time series database. In Proceedings of the 15th International Conference on Data Engineering (Cat. No. 99CB36337), Sydney, NSW, Australia, 23–26 March 1999; pp. 106–115. [Google Scholar]
  35. Yu, X.; Yu, H. An asynchronous periodic sequential patterns mining algorithm with multiple minimum item supports. In Proceedings of the 2014 Ninth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, Guangzhou, China, 8–11 November 2014; pp. 274–281. [Google Scholar]
  36. Fournier-Viger, P.; Lin, J.C.W.; Duong, Q.H.; Dam, T.L. PHM: Mining periodic high-utility itemsets. In Proceedings of the Industrial Conference on Data Mining, New York, NY, USA, 13–17 July 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 64–79. [Google Scholar]
  37. Lin, J.C.W.; Zhang, J.; Fournier-Viger, P. High-utility sequential pattern mining with multiple minimum utility thresholds. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data, Guangzhou, China, 23–25 August 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 215–229. [Google Scholar]
  38. Lin, J.C.W.; Zhang, J.; Fournier-Viger, P.; Hong, T.P.; Zhang, J. A two-phase approach to mine short-period high-utility itemsets in transactional databases. Adv. Eng. Inform. 2017, 33, 29–43. [Google Scholar] [CrossRef]
  39. Ayres, J.; Flannick, J.; Gehrke, J.; Yiu, T. Sequential pattern mining using a bitmap representation. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, 23–26 July 2002; pp. 429–435. [Google Scholar]
  40. Fournier-Viger, P.; Lin, J.C.W.; Gomariz, A.; Gueniche, T.; Soltani, A.; Deng, Z.; Lam, H.T. The SPMF open-source data mining library version 2. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Riva del Garda, Italy, 19–23 September 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 36–40. [Google Scholar]
  41. Dong, X.; Gong, Y.; Cao, L. e-RNSP: An efficient method for mining repetition negative sequential patterns. IEEE Trans. Cybern. 2018, 50, 2084–2096. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Framework of the SPHUSPM algorithm.
Figure 1. Framework of the SPHUSPM algorithm.
Symmetry 14 02032 g001
Figure 2. Execution times for different parameter values. ( m a x P e r m a x L a ( P L ) ).
Figure 2. Execution times for different parameter values. ( m a x P e r m a x L a ( P L ) ).
Symmetry 14 02032 g002
Figure 3. Pattern count for different parameter values. ( m a x P e r m a x L a ( P L ) ).
Figure 3. Pattern count for different parameter values. ( m a x P e r m a x L a ( P L ) ).
Symmetry 14 02032 g003
Figure 4. Candidates for different parameter values. ( m a x P e r m a x L a ( P L ) ).
Figure 4. Candidates for different parameter values. ( m a x P e r m a x L a ( P L ) ).
Symmetry 14 02032 g004
Table 1. Symbols.
Table 1. Symbols.
iitem
Xq-itemset
tsequence
sq-sequence
S , D a quantitative sequence database
s i d the identifier of sequence
q ( i , s ) the quantity of a q-item i in a q-sequence s
p ( i k ) the unit profit or importance (external utility) of i k
the lexicographical order
u ( i , q ) the utility of a q-item ( i , q ) in a q-sequence s
u ( X ) the utility of a q-itemset X in a q-sequence s
u ( s ) the utility of a q-sequence s
t s sequence t matches q-sequence s
v ( t , s ) the sequence utility of a sequence t in a q-sequence s
v ( t ) the utility of t in a q-sequence database S
u m a x ( t , s ) the maximum utility of a sequence t in a q-sequence s
u m a x ( t ) the maximum utility of a sequence t in a q-sequence database S
< s t > r e s t the extension of a sequence t in a q-sequence s
I ( t ) r e s t the set of extension items of a sequence t in a quantitative sequential database D
r u ( t , s ) the remaining utility of a sequence t in a q-sequence s
S ( t ) the set of q-sequences containing the sequence t
p e ( s α , s β ) the period of two consecutive q-sequence s α and s β
p e s ( t ) periods of the sequence t
l a ( t ) the lability of the sequence t
< t i j > the concatenation of t with i j
Table 2. A quantitative sequence database.
Table 2. A quantitative sequence database.
SIDQ-Sequence
S 1 <[(a,1)(b,1)(e,3)], [(c,3)(d,2)(g,3)], [(b,2)(e,1)], [(d,3)]>
S 2 <[(a,3)(b,1)(c,3)(f,2)], [(a,5)(c,2)(g,5)], [(b,3)(d,2)(e,2)]>
S 3 <[(b,1)(c,1)(e,2)(g,5)], [(a,3)(b,2)(e,4)(f,2)], [(b,2)(c,1)(e,2)]>
S 4 <[(b,2)(c,3)], [(a,5)(e,1)], [(b,4)(d,3)(e,5)]>
S 5 <[(a,4)(c,3)], [(a,2)(b,5)(c,2)(d,4)(e,3)]>
S 6 <[(f,4)], [(a,5)(b,3)], [(a,3)(d,4)]>
Table 3. A utility table.
Table 3. A utility table.
Itemabcdefg
Profit1342162
Table 4. The occurrences of all items.
Table 4. The occurrences of all items.
Sequnence ID1 2 3 4 5 6
Transaction ID123412312312312123
Itemsa aa a a aa aa
b b b bbbbb b b b
c cc c cc cc
d d d d d d
e e eeee ee e
f f f
g g g
Table 5. The utility-linked (UL)-list structure of s 1 .
Table 5. The utility-linked (UL)-list structure of s 1 .
UP Information of s 1 [ ( a , 1 , 41 , ) ( b , 3 , 38 , 7 ) ( e , 3 , 35 , 8 ) ],
[ ( c , 12 , 23 , ) ( d , 4 , 19 , 9 ) ( g , 6 , 13 , ) ],
[ ( b , 6 , 7 , ) ( e , 1 , 6 , ) ], [ ( d , 6 , 0 , ) ]
Header Table ofs 1 ( a , 1 ) ( b , 2 ) ( c , 4 ) ( d , 5 ) ( e , 3 ) ( g , 6 )
Table 6. The period-utility-linked (PUL)-list structure of < ( a b ) > .
Table 6. The period-utility-linked (PUL)-list structure of < ( a b ) > .
UP Information of s 1 [ ( e , 3 , 35 , 6 ) ],
[ ( c , 12 , 23 , ) ( d , 4 , 19 , 7 ) ( g , 6 , 13 , ) ],
[ ( b , 6 , 7 , ) ( e , 1 , 6 , ) ] ],[ ( d , 6 , 0 , ) ]
Header Table ofs 1 ( b , 5 ) ( c , 2 ) ( d , 3 ) ( e , 6 ) ( g , 4 )
UP Information ofs 2 [ ( c , 12 , 50 , 4 ) ( f , 12 , 38 , ) ],
[ ( a , 5 , 33 , ) ( c , 8 , 25 , ) ( g , 10 , 15 , ) ],
[ ( b , 9 , 6 , ) ( d , 4 , 2 , ) ( e , 2 , 0 , ) ] ]
Header Table ofs 2 ( a , 3 ) ( b , 6 ) ( c , 1 ) ( d , 7 ) ( e , 8 ) ( f , 2 ) ( g , 5 )
UP Information ofs 3 [ ( e , 4 , 24 , 5 ) ( f , 12 , 12 , ) ],
[ ( b , 6 , 6 , ) ( c , 4 , 2 , ) ( e , 2 , 0 , ) ]
Header Table ofs 3 ( b , 3 ) ( c , 4 ) ( e , 1 ) ( f , 2 )
UP Information ofs 5 [ ( c , 8 , 11 , ) ( d , 8 , 3 , ) ( e , 3 , 0 , ) ]
Header Table ofs 5 ( c , 1 ) ( d , 2 ) ( e , 3 )
UP Information ofs 6 [ ( a , 3 , 8 , ) ( d , 8 , 0 , ) ]
Header Table ofs 6 ( a , 3 ) ( d , 2 )
The Periodic Information of < ( ab ) > < 1 , 2 , 3 , 5 , 6 >
Table 7. Parameters of the datasets.
Table 7. Parameters of the datasets.
| D | Number of sequences
| I | Number of distinct items
C Average number of itemsets per sequence
T Average number of items per itemset
MaxLen Maximum number of items per sequence
Table 8. Characteristics of the datasets.
Table 8. Characteristics of the datasets.
Dataset | D | | I | CT MaxLen
Sign73026752.0194
Bible36,36913,90521.61100
Kosarak10k10,00010,0948.141608
Leviathan5834902533.81100
yoochoose-buys234,30016,0041.131.9721
MSNBC31,790423,77613.33186
Table 9. Memory usage for different parameter values.
Table 9. Memory usage for different parameter values.
SIGNBIBLE
m a x P e r m a x L a m i n u t i l M a x m e m o r y m a x P e r m a x L a m i n u t i l M a x m e m o r y
1%5%1.2%306.420.5%0.5%0.5%1106.55
1%5%1.7%306.320.5%0.5%1%1112.40
1%10%1.2%307.490.5%1%0.5%1118.63
1%10%1.7%306.680.5%1%1%1125.90
2%5%1.2%307.601%0.5%0.5%1113.62
2%5%1.7%306.811%0.5%1%1133.95
2%10%1.2%310.121%1%0.5%1140.22
2%10%1.7%307.751%1%1%1134.50
Kosarak10kLEVIATHAN
maxPermaxLaminutilMax memorymaxPermaxLaminutilMax memory
0.5%0.5%1.69%238.700.5%0.5%1%666.37
0.5%0.5%1.74%236.870.5%0.5%1.25%648.24
0.5%1%1.69%248.660.5%1%1%674.32
0.5%1%1.74%242.950.5%1%1.25%662.90
1%0.5%1.69%248.301%0.5%1%676.01
1%0.5%1.74%247.641%0.5%1.25%666.23
1%1%1.69%250.921%1%1%681.99
1%1%1.74%248.941%1%1.25%678.90
yoochoose-buysMSNBC
maxPermaxLaminutilMax memorymaxPermaxLaminutilMax memory
25%25%0.024%549.850.5%0.5%1%636.57
25%25%0.034%536.480.5%0.5%2%620.43
25%30%0.024%579.850.5%1%1%640.75
25%30%0.034%560.780.5%1%2%626.33
30%25%0.024%582.661%0.5%1%641.33
30%25%0.034%567.381%0.5%2%634.43
30%30%0.024%586.021%1%1%651.61
30%30%0.034%561.231%1%2%637.15
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xie, S.; Zhao, L. An Efficient Algorithm for Mining Stable Periodic High-Utility Sequential Patterns. Symmetry 2022, 14, 2032. https://doi.org/10.3390/sym14102032

AMA Style

Xie S, Zhao L. An Efficient Algorithm for Mining Stable Periodic High-Utility Sequential Patterns. Symmetry. 2022; 14(10):2032. https://doi.org/10.3390/sym14102032

Chicago/Turabian Style

Xie, Shiyong, and Long Zhao. 2022. "An Efficient Algorithm for Mining Stable Periodic High-Utility Sequential Patterns" Symmetry 14, no. 10: 2032. https://doi.org/10.3390/sym14102032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop