1. Introduction
Data mining is a very important technology in the field of computer science. It is a nontrivial process of revealing hidden, previously unknown and potentially valuable information from a large amount of data in the database. In recent years, utilitydriven mining and learning from data has received emerging attentions from researchers due to its high potential in many applications, covering finance, biomedicine, manufacturing, ecommerce, social media, etc. Current research topics of utilitydriven mining focus primarily on discovering patterns of high value (e.g., high profit) in large databases or analyzing the important factors (e.g., economic factors) in the data mining process [
1]. High utility sequential pattern (HUSP) mining is an important task in utilitydriven mining [
2,
3,
4,
5,
6,
7,
8]. It focuses on extracting subsequences with a high utility (importance) from quantitative sequential databases. Current HUSP mining algorithms, however, only consider occurring events and do not take nonoccurring events into account, which results in the loss of a lot of useful information [
9,
10,
11]. Thus, high utility negative sequential pattern (HUNSP) mining is proposed to address this issue by considering both occurring events and nonoccurring events. HUNSP can provide more valuable information that HUSP can not because a nonoccurring event may influence its following event. For example, given HUSP
$<abc>$ and HUNSP
$<a\neg bd>$.
$<abc>$ indicates that
a first occurs, then
b, and then
c while
$<a\neg bd>$ indicates that
a first occurs, if
b does not occur, then
d (not
c ) will occur. That is, whether
b occurs (or not) may result in the occurrence of
c (or
d ). Therefore, HUNSP is really important. But few studies on HUNSP have been proposed so far [
12,
13]. The high utility negative sequential pattern mining (HUNSPM) algorithm solved the key problem of how to calculate the utility of negative sequences by setting the utility of negative elements (not items) to 0 and choosing the maximum utility as the sequence’s utility [
14]. Nevertheless, both HUSP and HUNSP have a limitation in that they cannot indicate the probability that some subsequences will occur after other subsequences occur or not. For example, we assume HUSP
${p}_{1}$ =
$<abcX>$ and HUNSP
${p}_{2}$ =
$<a\neg bdY>$, where
a,
b,
c, and
d denote four symptoms, i.e.,
headache,
$sorethroat$,
$fever$, and
$cervicalpain$, and
X and
Y denote two diseases, i.e.,
$cold$ and
$cervicalspondylosis$.
${p}_{1}$ shows that patients who have
$headache$, then
$sorethroat$ and then
$fever$ are likely to have a
$cold$, whereas
${p}_{2}$ indicates that patients who have
$headache$ but not a
$sorethroat$ and then
$cervicalpain$ probably have
$cervical$$spondylosis$. Although
${p}_{1}$ and
${p}_{2}$ are useful apparently, we can not know the exact probability that patients will have disease
X (or
Y) after they show symptom
a, then
b (or not) and then
c (or
d) from them.
To address this issue, high utility sequential rule (HUSR) mining has been proposed in HUSP mining. HUSR, such as
$<abc>\Rightarrow <X>$, can precisely tell us the probability that patients will have a
$cold$ after they show
$headache$, then a
$sorethroat$ and then
$fever$. Unfortunately, research studies on HUSR mining are few. The HUSRM algorithm was proposed in HUSR mining [
15]. HUSRM first scans the database to build all sequential rules for which the sizes of the antecedent and consequent are both one. Then, it recursively performs expansions starting from those sequential rules to generate sequential rules of a longer size based on the supportconfidence framework and minimum utility threshold. However, the rules mined by HUSRM only guarantee the items in the consequent occur after the items in the antecedent but the items in the antecedent and consequent are internally unordered (e.g.,
a,
b, and
c are unordered in HUSR
$<abc>\Rightarrow <X>$). In addition, HUSRM has a very strict constraint for datasets, that is, an item cannot occur more than once in a sequence (e.g.,
$<aba>$ is not allowed because
a appears two times in the sequence), similar to the association rule mining. In fact, the constraint is too strict to be applicable for many reallife applications. Moreover, HUSRM does not take nonoccurring events into account, which can lead to the loss of a lot of valuable information.
In order to mine high utility sequential rules that consider nonoccurring events, high utility negative sequential rules (HUNSR) should be proposed. With HUNSR: $<a\neg bd>\Rightarrow <Y>$, we can clearly know the probability that patients will have $cervicalspondylosis$ after they show a $headache$, but not a $sorethroat$ and then $cervicalpain$. Unfortunately, we have not found any existing research on HUNSR mining so far. In fact, HUNSR mining is very challenging due to the following three intrinsic complexities.
Firstly, it is very difficult to define the HUNSR mining problem because of the hidden nature of nonoccurring events in HUNSR. For example, what is a valid HUNSR? There is no unified measurement to evaluate the usefulness of rules. The traditional supportconfidence framework is not applicable to mine HUNSR because it does not involve the utility measure [
16,
17,
18].Furthermore, the utilityconfidence framework in high utility association rule (HUAR) mining is not applicable either because it does not involve the ordinal nature of sequential patterns [
19]. So, it is very important to formalize the problem properly and comprehensively.
Secondly, it is very difficult to calculate the antecedent’s local utility value in a high utility negative sequential rule candidate (HUNSRC), which is a core step in calculating the utilityconfidence of a HUNSRC. For simplicity, we take HUSP
$<abad>$ and HUSR
$<ab>\Rightarrow <ad>$ for example to illustrate how to obtain the local utility value of
$<ab>$ in
$<abad>$. Different from HUAR mining where all items appear only once in a rule, the ordinal nature of sequential patterns determines that
$<ab>$ and
$<abad>$ may have multiple matches in a
qsequence (such as
${s}_{1}=\phantom{\rule{0.166667em}{0ex}}<abcabdcd>$), which makes the problem quite complicated. In order to calculate the local utility value of
$<ab>$ in
$<abad>$, we first need to obtain all the utility values of the matches of
$<abad>$ in
${s}_{1}$, then choose the maximum one and record its items’ transaction ID (TID) (such as
$\left\{1,2,4,6\right\}$) and find the TID of
$<ab>$ at this time, i.e.,
$\left\{1,2\right\}$. Next, find the corresponding utility of this
$<ab>$ (its TID is
$\left\{1,2\right\}$ ) in
${s}_{1}$. Find and sum the utility values of
$<ab>$ in each
qsequence containing
$<abad>$ in the database using the above method, and the sum is the local utility value of
$<ab>$ in
$<abad>$. However, it is much more difficult to calculate the antecedent’s local utility value (such as
$<a\neg b>$) in a HUNSR (such as
$<a\neg b>\Rightarrow <a\neg d>$) than a HUSR for the following two reasons. One is that it is rather difficult to determine the TID of a negative item in a
qsequence [
20]. The other is that the necessary information of sequence ID (SID), TID and utility to calculate the local utility value is not saved by current algorithms as these algorithms do not need to calculate the local utility value [
12,
13,
14].
Finally, the antecedent’s utility may not exist, that is, it may not be saved by current algorithms because it may be less than the minimum utility threshold. In fact, the antecedent’s utility is also a necessary condition to calculate the utilityconfidence of HUNSRC. Therefore, we must modify the existing algorithms to save the related information of the antecedent and recalculate its utility.
To address the above intrinsic complexities, this paper proposes a comprehensive algorithm called eHUNSR. The main ideas of eHUNSR are as follows. First, we formalize the HUNSR problems by defining a series of important concepts, including the local utility value, utilityconfidence measure, and so on. Second, a novel data structure called a SLUlist (sequence location and utility) is proposed to record all the required information to calculate the antecedent’s local utility value, including the information of SID, TID, and utility of the intermediate subsequences during the generation of HUSP which corresponds to the HUNSRC. Third, in order to efficiently calculate the local utility value and utility of the antecedent, we convert the HUNSR’s calculation problem to its corresponding HUSR’s calculation problem to simplify the calculation based on high utility negative sequential patterns discovered by the HUNSPM algorithm [
14]. In addition, we propose an efficient method to generate HUNSRC based on the HUNSP mined by the HUNSPM algorithm and a pruning strategy to prune the large proportion of meaningless HUNSRC.
Our main contributions are summarized as follows.
 (1)
We formalize the problem of HUNSR mining by proposing a series of concepts;
 (2)
We propose a novel data structure to store the related information of HUNSRC and a method to efficiently calculate the local utility value and utility of HUNSRC’s antecedent;
 (3)
We propose an efficient method to generate HUNSRC based on HUNSP and a pruning strategy to prune meaningless HUNSRC;
 (4)
Based on the above, we propose an efficient algorithm named eHUNSR to mine HUNSR. To the best of our knowledge, this is the first study to mine HUNSR. The experimental results on two reallife and 12 synthetic datasets show that the eHUNSR is very efficient.
The rest of the paper is structured as follows. In
Section 2, we briefly review the existing works proposed in the literature. In
Section 3, we provide some basic preliminaries.
Section 4 introduces our proposed eHUNSR algorithm.
Section 5 presents the experimental results. Finally, the paper ends with concluding remarks in
Section 6.
3. Preliminaries
In this section, we introduce some basic preliminaries.
Let $I=\left\{{i}_{1},{i}_{2},\dots ,{i}_{n}\right\}$ be a set of distinct items. Each item ${i}_{k}\in I(1\le k\le n)$ is associated with a positive number $p\left({i}_{k}\right)$, called its quality or external utility. A qitem $\left(i,q\right)$ is one in which $i\in I$ represents an item and q is a positive number representing the quantity or internal utility of i, e.g., the purchased number of i. A qitemset, which is a set of qitems $\left({i}_{k},{q}_{k}\right)$ for $(1\le k\le n)$, is denoted and defined as $l=\left[\left({i}_{1},{q}_{1}\right)\left({i}_{2},{q}_{2}\right)\dots \left({i}_{n},{q}_{n}\right)\right]$. If a qitemset contains only one qitem, the brackets can be omitted for brevity. A qsequence is defined as $s=<{l}_{1},{l}_{2},\dots ,{l}_{m}>$, where ${l}_{k}(1\le k\le m)$ is a qitemset. A negative qsequence denoted as $s=<\neg {l}_{1},{l}_{2},\dots ,{l}_{x}>$ is composed of more than one negative item, where ${l}_{k}(1\le k\le x)$ is called a positive qitemset, and $\neg {l}_{k}\left(1\le k\le x\right)$ is called a negative qitemset. A qsequence database D is composed of many tuples like $<sid,s>$, where s is a qsequence and $sid$ represents the unique identifier of s. $Size\left(s\right)$ represents the number of itemsets (positive or negative) in s. $Length\left(s\right)$ represents the number of items (positive or negative) in s.
We use the examples in
Table 1 and
Table 2 to illustrate the concepts. In
qsequence
${s}_{1}$ (
$sid$ = 1), (
a, 2), (
b, 2), (
f, 3), (
b, 3), and (
d, 3) are
qitems, where 2, 2, 3, 3, and 3 represent the internal utility of
a,
b,
f,
b, and
d respectively. [(
b, 2)(
f, 3)] is a
qitemset consisting of two
qitems. According to the utility table given in
Table 1, the external utilities of
a,
b,
f, and
d are respectively 2, 4, 1, and 5. For convenience, we use “
q” to name the object associated with quantity, that is, “
qitem”, “
qitemset”, and “
qsequence” are related to quantity. We denote the
$sid$ = 1
qsequence in
Table 2 as
${s}_{1}$, and the other
qsequences are numbered accordingly.
Definition 1. For a qsequence $s=<\left({s}_{1},{q}_{1}\right)\left({s}_{2},{q}_{2}\right)\dots \left({s}_{n},{q}_{n}\right)>$ and a sequence $t=<{t}_{1}{t}_{2}\dots {t}_{m}>$. s matches t if $n=m$ and ${s}_{k}={t}_{k}$ for $1\le k\le n$, denoted as $t\sim s$.
For example, $<a>$ has match $<(a,2)>$ in ${s}_{1}$. $<b>$ has two matches $<(b,2)>$ and $<(b,3)>$ in ${s}_{1}$. A sequence may have multiple matches in a qsequence.
Definition 2. The utility of a qitem $(i,q)$ is denoted as $u(i,q)$ and is defined as: For example, $u(b,2)=4\times 2=8$.
Definition 3. The utility of a qitemset $l=\left[\left({i}_{1},{q}_{1}\right)\left({i}_{2},{q}_{2}\right)\dots \left({i}_{n},{q}_{n}\right)\right]$ is denoted as $u\left(l\right)$ and is defined as: For example, $u\left(\right[(b,2)(f,3)\left]\right)=4\times 2+1\times 3=11$.
Definition 4. The utility of a qsequence $s=<{l}_{1},{l}_{2},\dots ,{l}_{n}>$ is denoted as $u\left(s\right)$ and is defined as the sum of the utilities of ${l}_{k}$: For example, $u<\left(b,1\right)\left(f,6\right)\left[\left(d,2\right)\left(e,3\right)\right]>=4\times 1+1\times 6+5\times 2+3\times 3=29$.
Definition 5. The utility of a sequence t in a qsequence s is denoted as $u(t,s)$ and is defined as: The utility of t in a qsequence database D is denoted as $u\left(t\right)$ and is defined as: For example, the utility of sequence $<ab>$ in qsequence ${s}_{1}$ is $\phantom{\rule{0.277778em}{0ex}}u(<ab>,{s}_{1})=\{u(<(a,2)\left(b,2\right)>,{s}_{1}),u(<(a,2)\left(b,3\right)>,{s}_{1})\}=\left\{12,16\right\}$. The utility of sequence $<ab>$ in D is $u(<ab>)=\{u(<ab>,{s}_{1}),u(<ab>,{s}_{2}),u(<ab>,{s}_{5})\}=\left\{\left\{12,16\right\},\left\{18\right\},\left\{14,10,12\right\}\right\}$.
Definition 6. We choose the maximum utility as the sequence’s utility. The maximum utility of sequence t is denoted and defined as ${u}_{max}\left(t\right)$: According to Definition 6, the utility of sequence $<ab>$ in D is $u(<ab>)=16+18+14=48$.
4. The eHUNSR Algorithm
In this section, we present the framework and working mechanism of the eHUNSR algorithm, which is illustrated in
Figure 1. More specifically, we first introduce the framework of eHUNSR, and then propose the utilityconfidence concepts, the method of HUNSRC generation, the pruning strategy, the data structure to store the related information of HUNSRC, and the utilityconfidence of the HUNSRC calculation.
4.1. The Framework of the eHUNSR Algorithm
Given a qsequence database, eHUNSR captures HUNSR in terms of the following four steps.
 Step 1.
Mine all HUNSP from the qsequence database using a traditional HUNSP mining algorithm, i.e., HUNSPM algorithm;
 Step 2.
Use a HUNSRC generation method to obtain all HUNSRC based on HUNSP;
 Step 3.
Remove unpromising HUNSRC and calculate the utilityconfidence of promising HUNSRC;
 Step 4.
Find all HUNSR satisfying the userspecified minimum utilityconfidence threshold.
4.2. The UtilityConfidence Concepts in HUNSR Mining
In this section, a series of definitions are proposed to construct the utilityconfidence framework in HUNSR mining. Calculating the local utility value is the core step of calculating the utilityconfidence of a rule, therefore we propose a series of definitions, i.e., Definitions 7–11 to illustrate the local utility value in HUNSR mining as follows.
Definition 7. Given qsequences ${t}_{1}=\phantom{\rule{0.166667em}{0ex}}<{l}_{1}{l}_{2}\dots {l}_{n}>$ and ${t}_{2}=<l{}_{1}^{\prime}l{}_{2}^{\prime}\dots .l{}_{{n}^{\prime}}^{\prime}>$ are subsequences of qsequence s, where ${t}_{1}\subseteq {t}_{2}\subseteq s$ and $s\in D$. The local utility value of ${t}_{1}$ in ${t}_{2}$ is the sum of the utility values of qitemset ${l}_{k}$$({l}_{k}\in {t}_{1}\wedge {t}_{2},1\le k\le n)$, which is denoted as $luv\left({t}_{1},{t}_{2},s\right)$ and is defined as: For example, in qsequence ${s}_{1}$, qsubsequence ${t}_{1}=\phantom{\rule{0.166667em}{0ex}}<\left(a,2\right)(b,2)>$, qsubsequence ${t}_{2}=\phantom{\rule{0.166667em}{0ex}}<\left(a,2\right)\left[(b,2)(f,3)\right]>,luv\left({t}_{1},{t}_{2},{s}_{1}\right)=luv(<\left(a,2\right)(b,2)>,<\left(a,2\right)\left[(b,2)(f,3)\right]>,{s}_{1})=2\times 2+4\times 2=12$.
Definition 8. Given sequences X and Y such that $X\subseteq Y$, y is the qsequence with the maximum utility value among all the qsequences matching Y in qsequence s and x is the qsequence that matches X in qsequence y. The local utility value of X in Y in the qsequence s is denoted as $luv(X,Y,s)$ and is defined as: For example, we need to calculate the local utility value of sequence $<a>$ in sequence $<ab>$ in qsequence ${s}_{1}$, i.e., $luv(<a>,<ab>,{s}_{1})$. $<ab>$ has two matches in ${s}_{1}$, i.e., $<(a,2)(b,2)>$ and $<(a,2)(b,3)>$ where $u(<(a,2\left)\right(b,2)>)=2\times 2+2\times 4=12$ and $u(<(a,2\left)\right(b,3)>)=2\times 2+3\times 4=16$, therefore $<(a,2)(b,3)>$ has the maximum utility of the above two matches. The match of sequence $<a>$ in qsequence $<(a,2)(b,3)>$ is $<(a,2)>$. Thus, $luv(<a>,<ab>,{s}_{1})=luv(<(a,2)>,<(a,2)(b,3)>,{s}_{1})=2\times 2=4$.
Definition 9. The local utility value of sequence X in Y such that $X\subseteq Y$ in a quantitative sequence database D is denoted as $luv(X,Y)$ and is defined as: For example, the local utility value of sequence $<a>$ in sequence $<ab>$ in D, i.e., $luv(<a>,<ab>)=luv(<a>,<ab>,{s}_{1})+luv(<a>,<ab>,{s}_{2})+luv(<a>,<ab>,{s}_{5})=luv(<(a,2)>,<(a,2)(b,3)>,{s}_{1})+luv(<(a,3)>,<(a,3)(b,3)>,{s}_{2})+luv(<(a,3)>,<(a,3)(b,2)>,{s}_{5})=4+6+6=16$.
In particular, if Y is a negative sequence, it is extremely difficult to calculate the local utility value $luv(X,Y)$ due to the hidden nature of nonoccurring events. In this paper, X and Y are converted to their corresponding maximum positive subsequences in order to simplify the calculation. The definition is shown below.
Definition 10. The maximum positive subsequence of a negative sequence $s=<\neg {l}_{1},{l}_{2},\dots ,{l}_{n}>$ is denoted as $MPS\left(s\right)$ and is defined as: For example, $MPS(<\neg ab>)=<b>,MPS\left(<ab>\right)=<ab>$.
Definition 11. The local utility value of a sequence s (positive or negative) in another negative sequence $ns$ such that $s\subseteq ns$ in a database D is denoted and defined as $luv(s,ns)$:where $MPS\left(s\right)$ and $MPS\left(ns\right)$ represent the maximum positive subsequences of s and $ns$ respectively. For example, $luv(<b>,<b\neg def>)=luv\left(<b>,<bef>\right),luv(<b\neg de>,<b\neg def>)=luv\left(<be>,<bef>\right)$. To address the problem that the utility of the antecedent in a negative sequential rule may be less than the minimum utility threshold, we replace the antecedent with its maximum positive subsequence. The definition is shown below.
Definition 12. The utility of a HUNSR’s antecedent $an$ is denoted and defined by: Based on Definitions 11 and 12, we know that the antecedent’s local utility value and utility can be calculated by only using the corresponding HUSP information of the HUNSRC. Hence, there is no need to consider the problem of determining a negative item’s TID and storing the related information of HUNSP.
The utilityconfidence measure is an important criterion to evaluate the usefulness of rules. The detailed definition is shown below.
Definition 13. A high utility negative sequential rule is an implication of the form R: $X\Rightarrow Y$, where X, $Y\ne \varphi $, X is antecedent of the rule, Y is consequent of the rule, and $X\bowtie Y$ is a HUNSP.
The utilityconfidence of the rule R is denoted by $uconf\left(R\right)$ and is defined by:where $MPS\left(X\right)$ and $MPS(X\bowtie Y)$ represent the maximum positive subsequences of X and $X\bowtie Y$ respectively. R: $X\Rightarrow Y$ is called a HUNSR if its utilityconfidence, i.e., $uconf\left(R\right)$ is more than or equal to the specified minimum utilityconfidence ($minuconf$) threshold.
For example, if the minimum utility ($minutil$) threshold is 37.8, the $minuconf$ is 0.5, $R:<a\neg \left(bf\right)>\Rightarrow <b\neg d>$ is a HUNSR since $u\left(R\right)=48\ge 37.8$ and $uconf\left(R\right)=0.89\ge 0.5$.
4.3. eHUNSR Candidate Generation
To generate all HUNSRC based on HUNSP, we use a straightforward method. The basic idea of the method is to divide a HUNSP into two parts, i.e., the antecedent and consequent. The detail is described as follows.
For a ksize $(k>1)$ HUNSP $P=<{e}_{1}{e}_{2}{e}_{3}\dots {e}_{k}>$, the set of its corresponding HUNSRCs is generated by dividing it into two parts, i.e., the antecedent $=<{e}_{1}{e}_{2}\dots {{e}_{i}}_{1}>(i\in \left\{2\dots k\right\})$ and the consequent $=<{e}_{i}\dots {e}_{k}>$. We can get $(k1)$ HUNSRCs.
For example, given a HUNSP $<a\neg \left(bc\right)d\neg ef\neg g>$, its corresponding five HUNSRCs are listed as follows: $\phantom{\rule{0.277778em}{0ex}}<a>\Rightarrow <\neg \left(bc\right)d\neg ef\neg g>,<a\neg \left(bc\right)>\Rightarrow <d\neg ef\neg g>,<a\neg \left(bc\right)d>\Rightarrow <\neg ef\neg g>,<a\neg \left(bc\right)d\neg e>\Rightarrow <f\neg g>,<a\neg \left(bc\right)d\neg ef>\Rightarrow <\neg g>$.
4.4. Pruning Strategy
We obtain all HUNSRCs using the eHUNSR candidate generation method. However, not all of them are promising. To avoid calculating the unpromising rules’ utilityconfidence, we propose a pruning strategy to remove these rules. We define a HUNSRC as an unpromising rule if its antecedent or consequent only contains one negative element. As the utility of a negative element is 0, it is meaningless if the utility of an antecedent or consequent in a HUNSRC is 0. A HUNSRC will be removed from the list if it satisfies one of the following two conditions:
 (1)
There is only one element in the antecedent and it is negative;
 (2)
There is only one element in the consequent and it is negative.
For instance, $\phantom{\rule{0.277778em}{0ex}}<\neg a>\Rightarrow <bc>,<ab>\Rightarrow <\neg c>$ and $\phantom{\rule{0.277778em}{0ex}}<\neg a>\Rightarrow <\neg c>$ are unpromising rules, while $<\neg ab>\Rightarrow <c>,<a>\Rightarrow <b\neg c>$, and $<\neg ab>\Rightarrow <\neg cd>$ are promising.
4.5. Data Structure
In order to efficiently calculate the local utility value and utility of HUNSRC’s antecedent, we design a novel data structure called the sequence location and utility list (SLUlist). It records the SID, TID, and utility information of the intermediate subsequences during the generation of HUSP which corresponds to HUNSRC. The SLUlist is composed of a Seqtable and LUtable. The Seqtable stores the intermediate subsequence of the HUSP generation process and the LUtable stores the corresponding SID, TID, and Utility. That is, each intermediate subsequence corresponds to multiple tuples like (SID, TID, and Utility).
Table 3 gives an example, showing the SID, TID, and utility information of the generation from
$<a>$ to
$<a\left(bf\right)>$. Take
$<ab>$ for example,
$<ab>$ has six matching
qsequences in the database in
Table 3, and the six
qsequences are from
${s}_{1}$,
${s}_{2}$ and
${s}_{5}$ respectively. In
${s}_{1}$, it matches two
qsequences; in the first matching
qsequence, item
a is from itemset 1 (TID = 1) and item
b comes from itemset 2 (TID = 2) and its utility is 12; in the second matching
qsequence, item
a is from itemset 1 (TID = 1) and item
b comes from itemset 3 (TID = 3) and its utility is 16. Similarly,
$<ab>$ has one and three matching sequences in
${s}_{2}$ and
${s}_{5}$ respectively.
4.6. The UtilityConfidence of HUNSRC Calculation
We can efficiently calculate HUNSRC’s utilityconfidence based on the above proposed definitions and the SLUlist structure. Firstly, we need to know the local utility value of the antecedent in HUNSRC. Then, we need to know the utility of the antecedent. Finally, the local utility value divided by the utility value is the utilityconfidence.
For example, if we want to calculate the utilityconfidence of HUNSRC:$<a>\Rightarrow <b\neg d>$, the first step is to calculate $luv(<a>,<ab\neg d>)$, the second step is to calculate ${u}_{max}\left(MPS\left(a\right)\right)$, then the $uconf(a\Rightarrow b\neg d)$ can be determined. The steps are as follows.
 Step 1:
Calculate the local utility value of the antecedent.
According to Equation (
11), calculating
$luv(<a>,<ab\neg d>)$ is equal to calculating
$luv(<a>$,
$<ab>)$. Firstly, sequence
$<ab>$ has two matches in
${s}_{1}$ and the corresponding utilities are 12 and 16 respectively in
Table 3. Next, the maximum utility is 16 and the corresponding TID of 16 is {1, 3} in
Table 3. Then, the TID of
$<a>$ is {1} and the corresponding utility is 4. Finally, we can obtain the utility of
$<a>$ in
${s}_{2}$ and
${s}_{5}$ by using the same method, i.e., 6 and 6 respectively and the sum is 16. Therefore, the local utility value of the antecedent is
$luv(<a>,<ab\neg d>)=4+6+6=16$.
 Step 2:
Calculate the utility of the antecedent.
According to Equation (
12),
${u}_{max}\left(MPS\left(a\right)\right)=4+6+8=18$ in
Table 3.
 Step 1:
Calculate the utilityconfidence $uconf\left(a\Rightarrow b\neg d\right)=\frac{luv\left(a,ab\right)}{{u}_{max}\left(MPS\left(a\right)\right)}=\frac{16}{18}=0.89$.
The HUNSRs extracted from the
qsequence database given in
Table 1 and
Table 2 with
$minutil=37.8$,
$minuconf=0.5$ are shown in
Table 4. There are 94 HUNSRs in the mining result. Here we only show a small number of them due to the limited space.
4.7. The eHUNSR Algorithm
The pseudocode of the eHUNSR algorithm is shown in Algorithm 1. It takes a quantitative sequential database D, $minutil$, $minuconf$ as inputs, and outputs all the HUNSR.
eHUNSR consists of four steps: (1) All HUNSPs are mined by the HUNSPM algorithm (Line 3); (2) all HUNSRCs are generated using the eHUNSR candidate generation method based on HUNSP (Line 4); (3) unpromising rules are removed from HUNSRC using the pruning strategy given in
Section 4.4 (Line 5–8); and (4) for each rule in HUNSRC, the antecedent’s local utility value, the antecedent’s utility, and the utilityconfidence of the HUNSRC are calculated using Equations (11)–(13) (Line 9–11). If the HUNSRC’s utilityconfidence satisfies the
$minuconf$, it will be added to HUNSR (Line 12–14).
Algorithm 1 eHUNSR Algorithm 
 1:
Input: A quantitative sequential database D, $minutil$, $minuconf$.  2:
Output: All HUNSRs.  3:
mine all HUNSPs by HUNSPM algorithm;  4:
HUNSRC = eHUNSR candidate generation (HUNSP);  5:
for each rule $R:X\Rightarrow Y$ in HUNSRC do  6:
if ($size\left(X\right)=1$ and X.utility = 0) or ($size\left(Y\right)=1$ and Y.utility = 0) then  7:
Remove it from HUNSRC;  8:
else  9:
Calculate $luv(X,X\bowtie Y)$ based on the SLUlist by Equation ( 11)  10:
Calculate ${u}_{max}\left(MPS\left(X\right)\right)$ based on the SLUlist by Equation ( 12)  11:
Calculate $uconf(R:X\Rightarrow Y)$ by Equation ( 13)  12:
if $uconf(R:X\Rightarrow Y)\ge minuconf$ then  13:
HUNSR.add (R)  14:
end if  15:
end if  16:
end for  17:
Return HUNSR.

4.8. Theoretical Analysis about the UtilityConfidence Framework
Since the utilityconfidence framework is the core of our proposed eHUNSR algorithm, we give a theoretical analysis to prove the rationality of the utilityconfidence framework in this section.
Traditional association rules mining algorithms depend on supportconfidence framework in which all items are given the same importance [
16,
21,
22,
23]. The goal of the algorithms is to extract all the valid association rules whose confidence has at least the user defined confidence. But they do not consider utility, this means some important patterns with low frequencies will be lost [
14]. For example, consider a sequential rule
$R:f\Rightarrow e$ in
Table 2. The support of
R, i.e., support(
$fe$) is one because the number of the
qsequences in
Table 2 containing
$fe$ is one, i.e.,
${\mathrm{s}}_{3}$. Similarly, support(
f) = 5. The confidence of
R is 0.2 which is calculated by confidence (
R)=support(
$fe$)/ support(
f). If the
$minimumconfidence$ is 0.3,
$R:f\Rightarrow e$ will be considered an invalid rule. But if we consider utility, the utility confidence of
R is
$uconf\left(R\right)=luv(f,fe)/\left(u\right(f\left)\right)=6/15=0.4.$$R:f\Rightarrow e$ will be considered a valid rule when
$minuconf$ is 0.3.
Why is the same rule treated differently under different frameworks? In fact, the supportconfidence framework does not provide any additional knowledge except the measures that reflects the statistical correlation among items [
19,
24].
$R:f\Rightarrow e$ is considered an invalid rule just because
f only contributes one time to
$fe$ when
f has already occurred five times in
Table 2. In addition, it does not reflect their semantic implication towards the mining knowledge, that is, it does not take utility into account [
25,
26]. However, the
$uconf\left(R\right)=0.4$ indicates that the utility contribution of
f to
$fe$ accounts for 40% of the total utility of
f under the utilityconfidence framework. Hence,
$R:f\Rightarrow e$ is considered a valid rule. The supportconfidence model may not measure the usefulness of a rule in accordance with a user’s objective (for example, profit) and the utilityconfidence framework is more reasonable for decisionmaking.
5. Experiments
We conduct experiments on two reallife and 12 synthetic datasets to evaluate the efficiency of eHUNSR. Since eHUNSR is the first algorithm for high utility negative sequential rule mining, there are no baseline algorithms to compare with, we only test the performance of eHUNSR in terms of execution time and number of HUNSRs under different factors. In the experiments, all HUNSPs are mined by the HUNSPM algorithm and all HUNSRs are identified by eHUNSR. The algorithm is written in Java and implemented in Eclipse, running on Windows 10 PC with 16GB memory, Intel (R) Core (TM) i76700 CPU of 3.40 GHz.
5.1. Datasets
We use the following data factors: C, T, S, I, DB, and N defined in [
27] to describe the impact of data characteristics.
C: Average number of elements per sequence; T: Average number of items per element; S: Average length of maximal potentially large sequences; I: Average size of items per element in maximal potentially large sequences; DB: Number of sequences; and N: Number of items.
DS1 (C8_T2_S6_I2_DB10k_N0.6k) and DS2 (C10_T2_S6 _I2 _DB10k_N1k) are synthetic datasets generated by the IBM data generator [
27]. DS3 and DS4 are reallife datasets. DS3 is the BMSWebView2 dataset from KDDCUP 2000 [
28,
29]. It includes clickstream data from Gazelle.com. The dataset contains 7631 shopping sequences and 3340 products. The average number of elements in each sequence is 10, the max length of a customer’s sequence is 379, and the most popular product is ordered 3766 times. DS4 is a dataset of sign language utterance containing 800 sequences [
30]. It is a dense dataset with very long sequences with 267 distinct items and the average sequence length is 93.
Table 5 shows the four datasets and their different characteristics. Since only synthetic datasets have data factors S and I (reallife datasets do not), we do not show them in
Table 5. DS1 is moderately dense and contains short sequences. DS2 is moderately dense and contains medium length sequences. DS3 is a sparse dataset that contains many medium length sequences and a few very long sequences. DS4 is a dense dataset having very long sequences.
For all datasets, external utilities of items are generated between 0 and 50 using a lognormal distribution and the quantities are generated randomly between 1 and 10, similar to the settings of [
3,
28].
5.2. Evaluation of $minutil$ Impact
We analyze the impact of $minutil$ on the algorithm performance in terms of the running time and number of HUNSRs, thus the $minuconf$ is fixed and the $minutil$ is varied. Since the four datasets have different characteristics, we set different $minutil$ and $minuconf$ values to better reflect the impact of $minutil$ respectively.
Figure 2a shows that with the increase of
$minutil$, the execution time and number of HUNSRs decreases gradually. This is because the number of HUNSPs decreases with the increase of
$minutil$.
Figure 2b–d show a similar trend for both synthetic and reallife datasets from DS2 to DS4. The results in
Figure 2 also show that eHUNSR can extract HUNSR under very low
$minutil$ (e.g., 0.00092 for DS2). It is worth noting that DS4 is a very dense dataset, and the result shown in
Figure 2d indicates that eHUNSR also has good adaptability to dense datasets.
5.3. Evaluation of minuconf Impact
In this experiment, we assess the impact of $minconf$ on the algorithm performance in terms of the running time and number of HUNSRs. The $minutil$ is fixed and the $minuconf$ is varied.
Figure 3a shows that with the increase of
$minuconf$, the number of HUNSRs decreases gradually, while the execution time does not change much. A similar trend can be found in the results of
Figure 3b–d from DS2 to DS4.
5.4. Data Characteristics Analysis
In this section, we explore the impact of data characteristics on the performance of eHUNSR as well as the sensitivity of eHUNSR to particular data factors.
DS1 is extended to 10 new datasets by tuning each factor and we mark the different factors in bold for each dataset as shown in
Table 6. For example, DS1.1 (C6_T2_S6_I2_DB10k_N0.6k) and DS1.2 (C10_T2_S6_I2_DB10k_N0.6k) have a different data factor C compared with DS1 (C8_T2_S6_I2_DB10k_N0.6k) and we test the influence of data factor C on algorithm performance through the three datasets. We analyze the performance of eHUNSR in terms of the running time and the number of HUNSRs from the perspective of
$minutil$ and
$minuconf$ based on different datasets respectively.
According to the results shown in
Figure 4,
Figure 5,
Figure 6 and
Figure 7, factors C and T significantly affect the performance of the eHUNSR algorithm, while factors S and I have little effect on it. In general, with the increase of factors C, T, S, and I, the running time and the number of HUNSRs increase accordingly. Taking the results in
Figure 5 for example, when factor T is higher, such as DS1.4, the eHUNSR generates more rules and takes more time than DS1. However, factor N is the opposite as the running time and the number of HUNSRs first increase and then decrease with the increase of N according to the results shown in
Figure 8. Moreover, from
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8, the descending gradient of factors C and N is larger than that of T, S, and I. This indicates that eHUNSR is more sensitive to factors C and N than T, S, and I.
5.5. Scalability Test
Since the algorithm’s performance is affected by the dataset size, we conducted a scalability test to evaluate eHUNSR’s performance on large
qsequence datasets.
Figure 9 shows the results on datasets DS2 and DS3 based on different sizes: From 5 (i.e., 2.05 M) to 20 (41.1 M) times of DS2, and from 5 (3.78 M) to 20 (75.7 M) times of DS3.
The results in
Figure 9 show that the growth of the runtime of eHUNSR on large
qsequence datasets follows a roughly linear relationship with the datasets size increasing with different
$minutil$ values. The results also show that eHUNSR works particularly well on huge
qsequence datasets.
5.6. A RealLife Application of the eHUNSR Algorithm
In this section, we apply the eHUNSR algorithm to prefabricated chunks extraction and prediction. Prefabricated chunks are composed of more than one word that occur frequently and are stored and used as a whole, such as “put off”, “by the way”, “it is important that”, etc. Prefabricated chunks has played an important role in language learning [
31], including helping teachers establish some new teaching models [
32], improving fluency and accuracy of an interpreter [
33], increasing students’ listening predictive ability [
34], etc.
The dataset used for this application is Leviathan, which is a conversion of the novel Leviathan by Thomas Hobbes (1651) as a sequence database (each word is an item) [
30]. The dataset has 5834 sequences, 9025 items, and the average sequence length is 33.8. We randomly selected 100 sequences to conduct this experiment. In total, 171 rules were obtained with
$minutil$ = 0.078 and
$minuconf$ = 0.4. Since converting items to words is timeconsuming, we only show five rules to analyze the meaning of the rules. The five rules are shown in
Table 7.
Rule 1 indicates that if “controversies, the, cause, of, war ” occurs in order first (“ ¬and ” indicates that “and” does not occur), then we can predict with 90% certainty that “against, the, law” will occur next in order in the novel Leviathan. Rule 2 indicates that if “ sickness ” occurs in order first, we can predict with 70% certainty that “civil, war, death ” will occur next. Rule 3 indicates that if “cause, of, sense” occurs in order first, we can predict with 43% certainty that “ is, the, external, body” will occur next in order. Rule 4 indicates that if “several, motions, diversely” occurs in order first, we can predict with 82% certainty that “to, counterfeit, just, trust” will occur next. Rule 5 indicates that if “things, suggested” occurs in order first, we can predict with 80% certainty that “the, memory, and, equity, and, laws” will occur next in order.
In fact, there are sentences like “So, the controversies, that is, the cause of war, remains against the Law of nature", “Concord, health; sedition, sickness; and civil war, death" and so on in the original novel. This proves that our extracted rules are useful. The antecedents and consequents like “sickness" and “civil, war, death" are extracted prefabricated chunks. The value of uconf represents the possibility of the occurrence of the predictive prefabricated chunks. These prefabricated chunks can help people understand the novel better and quickly to a certain extent.
6. Conclusions and Future Work
HUSR mining can precisely tell the probability that some subsequences will happen after other subsequences happen, but it does not take nonoccurring events into account, which can result in the loss of valuable information. So, this paper has proposed a comprehensive algorithm called eHUNSR to mine high utility negative sequential rules. First, in order to overcome the difficulty of defining the HUNSR mining problem, we have defined a series of important concepts, including the local utility value, utilityconfidence measure, and so on. Second, in order to overcome the difficulty of calculating the antecedent’s local utility value in a HUNSR, we proposed a novel data structure called a SLUlist to record all required information, including information of SID, TID, and utility of the intermediate subsequences during the generation of HUSP. Third, in order to efficiently calculate the local utility value and utility of the antecedent, we converted the HUNSR’s calculation problem to its corresponding HUSR’s calculation problem to simplify the calculation. In addition, we proposed an efficient method to generate HUNSRC based on the HUNSP mined by the HUNSPM algorithm and a pruning strategy to prune a large proportion of meaningless HUNSRCs. To the best of our knowledge, eHUNSR is the first study to mine HUNSR. The experimental results on two reallife and 12 synthetic datasets show that eHUNSR is very efficient.
From the experiments, we can see that the number of HUNSRs mined from HUNSP is very large and our recent research shows that not all of the HUNSRs can be used to make decisions. Our future work is to find strategies to select those actionable HUNSRs. In addition, many research studies are based on a quantitative database or fuzzy data and are very useful [
35]. But with the development of economy and the progress of Internet technology, the amount of information generated in social media channels (e.g., Twitter, Linkedin, Instagram, etc.) or economical/business transactions exceeds the usual bounds of static databases and is in continuous movement [
36]. Therefore, it is important to design a streaming rule extraction algorithm in a dynamitic databases.