3.1. Word2vec Moving Distance Model
There are two main methods used for text vectorization: word2vec and doc2vec. Word2vec only performs “semantic analysis” based on the dimension of the word, and does not have the contextual capability of “semantic analysis”. The text data in this paper is mainly the name of the fault, not the description of the fault with the context.
Reference [
6] adopted word2vec. Word2vec was also employed in 2013 as an efficient tool for Google to express words as real-valued vectors. Kai et al. [
7]. argued that the domain knowledge is reflected by the semantic meanings behind keywords rather than the keywords themselves. They applied the word2vec model to represent the semantic meaning of the keywords. Based on that work, they proposed a new domain knowledge approach, the semantic frequency semantic active index, similar to the frequency-inverse document frequency, to link domain and background information and to identify infrequent but important keywords. Park et al. [
8]. suggested an efficient classification method of Korean sentiment using word2vec and recently studied ensemble methods. For the 200,000 Korean movie review texts, they generated a POS (Part Of Speech)-based BOW (Bag Of Words) feature and a feature using word2vec, and integrated all the features of the two feature representations.
Yongjun et al. [
9]. examined the ability of word2vec to derive semantic relatedness and similarity between biomedical terms from large publication data. They downloaded abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets were preprocessed and grouped into subsets by recency, size, and section. Word2vec models were trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models were compared against reference standards. The performance of models trained on different subsets were compared to examine recency, size, and section effects. To extract key topics from new articles, Zhao et al. [
10]. researched into a new method to discover an efficient way to construct text vectors and improve the efficiency and accuracy of document clustering based on the word2vec model. Through training, the processing of text content was reduced to
K-dimensional vector operations, and the similarity in vector space can be used to represent the semantic similarity of text. The word2vec word vector model includes the CBOW (Continuous Bag-of-Words Model) model and Skip-gram model. It can be designed based on Hierarchical Softmax and Negative Sampling algorithms.
The schematic diagram of the CBOW model based on the Hierarchical Softmax is shown in
Figure 1. It is composed of three layers: input layer, projection layer, and output layer. Here, we take the sample
(with m words before and after
w) as an example to explain.
Input layer: One-hot representation with a total of 2 m words in context. There are nodes .
Projection layer: Accumulating sum of 2 m vectors in the input layer: with a total of N nodes.
Output layer: A Huffman tree using the word frequency of each word in the corpus as its weight. Its leaf nodes are all words appearing in the corpus. There are a total of V leaf nodes, corresponding to each word in dictionary D, and there are V − 1 non-leaf nodes.
Among them, there is a word matrix from the input layer to the projection layer. The matrix is essentially the output form of the word vector after training.
The word vector matrix can be obtained by word2vec, where N represents the dictionary consisting of N words, and d represents the dimension of the word vector. The i-th column in the matrix, the column vector , represents the word vector of the i-th word in the dictionary in the d-dimensional space.
The idea of the word vector moving distance model is that each word vector in the text can be partially or completely transformed into a word vector in the text, that is, each word in one text is matched to all words in the other text with different weights.
Standardized word bag representation: . represents the frequency of word in text which has m different words.
Word vector moving cost: The goal is to combine the degree of semantic similarity between word pairs into the text distance matrix. The Euclidean distance between and is the word vector moving cost.
Word vector moving distance: is a flow matrix and represents the number of the i-th word in text , which flows to the j-th word in text . For the purpose of fully transforming text into text , it should be guaranteed that the sum of the outflow i-th word should be equal to , , and the inflow of the j-th word should be equal to , . The distance between text and text can be represented by the minimum cumulative cost of the word moving from text to .
A word vector movement distance model is created as follows:
The algorithm complexity is , where p represents the number of different words.
Based on the distance of the word vector of the two texts, the text similarity of the two texts can be calculated by normalizing the moving distance of the word vector between the texts. This is calculated as follows:
and
, respectively, represent the minimum word vector movement distance and the maximum word vector movement distance in the data set.
Because of the similarity measure characteristics, the similarity matrix is a symmetric matrix with 1 on the diagonal, where the range of elements is (0,1). If the similarity of two texts is greater, then the distance will be smaller. On the contrary, the smaller the similarity, the greater the distance. Therefore, the final distance between two texts is the reciprocal of their similarity.
3.2. Clustering Algorithm for Failure Type
The
k-means method [
11] is a classical method used to solve the clustering problem. The algorithm is very subjective and requires the number of clusters which are specified in advance. Many clustering algorithms have been developed, such as grid-based [
12], hierarchy-based [
13], model-based [
14], and density-based [
15] clustering algorithms. The processing time of the grid clustering algorithm is related to the number of cells divided by each dimensional space, which reduces the quality and accuracy of clustering. The computational complexity of the hierarchy-based algorithm is too high. The model clustering algorithm is based on the hypothesis: that variables are independent of each other. However, this assumption is often not true. For the density-based clustering algorithm, when the density distribution is not uniform, the clustering effect is worse. The clustering algorithm used in this paper is a new clustering algorithm proposed by Rodriguez and Laio [
16] in “
Science”, which is novel and simple and fast. According to the characteristics of the data, the clustering algorithm can automatically determine the number of cluster centers. The clustering effect and computational efficiency are very high. There are two basic assumptions in the clustering algorithm:
This clustering algorithm can be divided into four steps. Here is a brief introduction to these four steps:
1. Calculate the local density
The clustering set is
. This paper adopts the Gaussian kernel function to calculate the density. The formula is as follows:
where
is the number of data points whose distance is less than
dc, regardless of the value of
xi itself.
is an indicator set.
represents the distance between points
and
. The parameter d
c should be specified in advance. To some extent, this parameter
dc determines the effect of the clustering algorithm. If
dc is too large, the local density value of each data point will be large, resulting in low discrimination. The extreme case is that the value of
dc is greater than the maximum distance of all points, so the end result of the algorithm is that all points belong to the same cluster center. If the value of
dc is too small, the same group may be split into multiples. The extreme case is that
dc has a smaller distance than all points, which will result in each point being a cluster center. The reference method given by the author in this paper is to select a
dc so that the average number of neighbors per data point is about 1–2% of the total number of data points.
2. Calculate the distance
A descending sequence of subscripts
is generated:
The distance formula is as follows:
For the above formula, when i = 1, is the distance between the data point with the largest distance from xi in S. If , represents the distance between xi and the data point (or those points) with the smallest distance from xi for all data points with a local density greater than xi.
3. Select the clustering center
So far, the (
of every data point can be achieved. For the comprehension consideration, we use the following formula to select the clustering center:
For example, the following figure (
Figure 2) contains 20 data points. We already have got the (
of every data point.
As shown in
Figure 2, Figure (A) is the clustering effect diagram of data points. Data points are divided into two clusters. The center of first cluster is data point 1 while the center of the second cluster is data point 10. These two clustering centers are selected according to Figure (B). In Figure (B),
is the number of data points whose distance is less than this point.
is the distance between the data points. From Figure (B), we can see that data points 1 and 10 are far away from other points in the coordinate system. According to the core idea of this clustering algorithm, clustering centers are those with many data points around them and are far away from other clustering centers. Therefore, data points 1 and 10 are the clustering centers in this case.
Next, we calculate the
to select the cluster center. The following figure (
Figure 3) is the
curve.
According to this figure, it was found that the curve is smoother for the non-cluster centers. Furthermore, there is a clear jump between the cluster centers and non-cluster centers.
4. Categorize other data points
According to the cluster center, the distance between the cluster center and the data points can be calculated. The data points are classified into the cluster center which is closest to each data point.
3.3. Failure Sequence Mining Algorithm—PrefixSpan
Common sequence pattern algorithms are the Generalized Sequential Pattern (GSP mining algorithm), Apriori, CloSpan and PrefixSpan. GSP and Apriori are the traditional algorithms for sequence mining and their performances are worse than that of PrefixSpan. CloSpan is suitable for long serial text mining. In terms of short sequence mining, PrefixSpan is better.
This paper’s text data sequence is shorter, so this paper adopts the PrefixSpan algorithm [
17]. PrefixSpan is a kind of sequential pattern mining algorithm. PrefixSpan has been applied in many fields. For example, it is applied in the mining process for the Indonesian language, which continues to be an interesting research topic. Maylawati et al. [
18]. compared several sequential pattern algorithms, including BI-Directional Extention (BIDE), PrefixSpan, and TRuleGrowth. They founded that the average processing time of PrefixSpan was faster than those of BIDE and TRuleGrowth. On the other hand, PrefixSpan and TRuleGrowth were more efficient in using memory than BIDE.
In order to solve the problem of large space and time overhead in the PrefixSpan algorithm, a new sequential pattern mining algorithm based on PrefixSpan is proposed, termed as the PrefixSpan-
x algorithm. This algorithm [
19] reduces unnecessary storage space and removes the non-frequent items. PrefixSpan has also been applied to big data. To support PrefixSpan scalability, there exist two problems regarding its implementation in a MapReduce framework. The first problem is related to parsing and analyzing big data, while the second is related to managing projected databases. In this paper, we propose two methods, i.e., Multiple MapReduce and Derivative Projected Database to overcome the first and the second problems. Sambrina et al. [
20]. Showed that those proposed methods can significantly reduce execution time in supporting the scalability of PrefixSpan.
A sequence database S is a collection of different sequences, while s is a sequence of it. The sequence is the subsequence of which also indicates that sequence s includes , . If there is an integer , make , , …, . The degree of support for the sequence in the sequence database S is the number of sequences containing the sequence in the sequence database S, denoted as Support(). Given the support threshold min_sup, if the support of the sequence in the sequence database is not less than min_sup, the sequence is called sequence mode. Among them, a sequence pattern with length l is denoted as l-mode.
Definition 1. Prefix: Set all items in each element of the sequence in lexicographic order. The sequences,(m < n) are given. If,anditems are behind the project in, thenis the prefix of.
Definition 2. Projection: Given the sequenceand, ifis the subsequence of,which was the projection offormust meet the following constraints:is the projection ofandis the largest subsequence ofthat satisfies the above conditions.
Definition 3. Suffix: The projectionof subsequencefor sequenceis(n > m). The suffix offor sequenceis,.
Definition 4. Projection database and projection database support: Letbe a sequence pattern in the sequence database S. The sequenceis prefixed with. Then the projection database ofis the suffix of all-prefixed sequences in S relative to, denoted as. The support degree ofin’s projection databasemeets the value of sequencewith.
The PrefixSpan algorithm is a frequent pattern mining method that does not require candidates. The basic idea of this method is as follows: First find out each frequent item, then produce a collection of projection databases, each associated with a frequent item in the projection database. Next, excavate each database separately. The algorithm constructs a prefix pattern, which is connected to the suffix pattern to obtain frequent patterns, thereby avoiding the generation of candidates.
The following is an example of the sequence database
S with min_sup = 2 to describe the mining process, as shown in
Table 3.
- (1)
Obtain a sequence pattern of length 1. Scan S once to find all sequence patterns of length 1 in the sequence. They are <a>: 4, <b>: 4, <c>: 4, <d>: 3, <f>: 3. “<mode>: Count” indicates the mode and its support count.
- (2)
Divide the search space. The complete set of sequence patterns can be divided into the following six subsets based on six prefixes: The prefixes are subsets of <a>, <b>, <c>, <d>, <e> and <f> respectively.
- (3)
Find a subset of the sequence patterns. The subset of the sequence patterns mentioned in step 2 can be mined by constructing a corresponding projection database and mining each one recursively.
The sequence mode is shown in
Table 4.
3.4. Bayesian Failure Network Model
The Bayesian network is a probabilistic graph model that represents the relationship between a series of random variables and variables in a directed acyclic graph. Faults in air handling units (AHUs) affect the building energy efficiency and indoor environmental quality significantly. There is still a lack of effective methods for diagnosing AHU faults automatically.
In Zhao’s 2017 study [
21], a diagnostic Bayesian networks (DBNs)-based method was proposed to diagnose 28 faults, which cover most of the common faults in AHUs. Rear-end crash is one of the most common types of traffic crashes in the U.S. A good understanding of its characteristics and contributing factors is of practical importance. Previously, both multinomial logit models and Bayesian network methods have been used in crash modeling and analysis, respectively, although each of them has its own application restrictions and limitations. In Chen’s 2015 [
22] study, a hybrid approach was developed to combine multinomial logit models and Bayesian network methods to comprehensively analyze driver injury severities in rear-end crashes based on state-wide crash data collected in New Mexico from 2010 to 2011.
In order to increase the diagnostic accuracy of the ground-source heat pump (GSHP) system, especially for multiple-simultaneous faults, Cai et al. [
23] proposed a multi-source information fusion based fault diagnosis methodology by using Bayesian network, due to the fact that it is considered to be one of the most useful models in the field of probabilistic knowledge representation and reasoning, and can deal with the uncertainty problem of fault diagnosis well. The nodes of the graph represent random variables, and the directed edges from one (parent) node to another (child) node represent the relationship between the two node variables. The probability relationship between child nodes and parent nodes is represented by a conditional probability table.
The basic idea of the Bayesian network is to use probabilistic methods to deal with uncertainty in real life. It has a strong probabilistic reasoning ability and can learn rules from a large number of seemingly random and irregular data. After determining the structure and parameters of the Bayesian network, the Bayesian network model can be used to predict failure at specific input conditions.
One of the most important features of Bayesian networks is their ability to provide a good mathematical model for modeling complex relationships between random variables while maintaining a relatively simple visual presentation. They can be used to describe causal relationships between variables on a strict mathematical basis.
As shown in
Figure 4. In the unknown case of C, A and B are independent, and this structure is called head-to-head condition independence. However C also depends on two random variables, A and B. The relationship between them can be expressed as:
As shown in
Figure 5. In the case of C which is given, A and B are independent. This structure is called a tail-to-tail condition independent structure. Both random variables A and B are dependent on C, so the relationship between them can be expressed as:
As shown in
Figure 6. In the case of B, which is given, A and C are independent. This structure is called head-to-tail condition independent; the head-to-tail structure can also be called a chained network. The variable B now depends on the variable A, while the random variable C depends on the variable B. The relationship between them can be expressed as:
Any complex Bayesian network can be formed by combining the three most basic forms of the network. The establishment of a Bayesian network is divided into two processes: structural learning and parameter learning. In the structural learning phase, the topological relationship between variables is determined by the sequence pattern. This is achieved by constructing a corresponding directed acyclic graph. The parameter learning phase involves the construction of a conditional probability table. If the value of each variable is directly observable, then the parameters of the network can be obtained directly. When the observations are complete, we use maximum likelihood estimation to obtain the parameters. Its log-likelihood function is:
where
represents the dependent variable of
.
represents the observed value.
represents total number of observations.