Gaussian-Linearized Transformer with Tranquilized Time-Series Decomposition Methods for Fault Diagnosis and Forecasting of Methane Gas Sensor Arrays

Zhang, Kai; Ning, Wangze; Zhu, Yudi; Li, Zhuoheng; Wang, Tao; Jiang, Wenkai; Zeng, Min; Yang, Zhi

doi:10.3390/app14010218

Open AccessArticle

Gaussian-Linearized Transformer with Tranquilized Time-Series Decomposition Methods for Fault Diagnosis and Forecasting of Methane Gas Sensor Arrays

by

Kai Zhang

,

Wangze Ning

,

Yudi Zhu

,

Zhuoheng Li

,

Tao Wang

,

Wenkai Jiang

,

Min Zeng

^*

and

Zhi Yang

^*

Key Laboratory of Thin Film and Microfabrication (Ministry of Education), Department of Micro/Nano Electronics, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(1), 218; https://doi.org/10.3390/app14010218

Submission received: 12 October 2023 / Revised: 24 November 2023 / Accepted: 27 November 2023 / Published: 26 December 2023

(This article belongs to the Special Issue Recent Advances in Intelligent MEMS Sensors)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Methane is considered as a clean energy that is widely used in places with high environmental requirements. The increasing demand for methane exploration in polar and deep sea extreme environments has a positive role in carbon neutrality policies. As a result, there will be a gradual increase in exploration activities for deep sea methane resources. Methane sensors require high reliability but are prone to faults, so fault diagnosis and forecasting of gas sensors are of vital practical significance. In this work, a Gaussian-linearized transformer model with a tranquilized time-series decomposition method is proposed for fault diagnosis and forecasting tasks. Since the traditional transformer model requires more computational expense with time complexity of O (N²) and is not applicable to continuous-sequence prediction tasks, two blocks of the transformer are improved. First, a Gaussian-linearized attention block is modified for fault-diagnosis tasks so that its time complexity can be changed to O (N), which can reduce computational resources. Second, a model with proposed attention for fault forecasting replaces the traditional embedding block with a decomposed block, which can input the continuous sequence data to the model completely and preserve the continuity of the methane data. Results show that the Gaussian-linearized transformer improves the accuracy of fault diagnosis to 99% and forecasting with low computational cost, which is superior to that of traditional methods. Moreover, the least mean-square-error loss of fault forecasting is 0.04, which is lower compared with the traditional time series prediction models and other deep learning models, highlighting the great potential of the proposed transformer for fault diagnosis and fault forecasting of gas sensor arrays.

Keywords:

self-attention; transformer; sensor arrays; methane; deep sea

1. Introduction

Methane hydrate, also known as combustible ice, looks like ice and burns when exposed to an open flame [1,2,3]. Countries such as the United States, Japan, and Germany lead the world in mining combustible ice [4,5,6,7]. Its combustion produces ten times more energy than coal, gasoline, and natural gas, making it the ideal energy source for a low–carbon society [8,9,10]. Therefore, methane sensors used to detect methane are important for the exploitation of combustible ice [11]. However, due to the impact of prolonged methane exploration engineering under the water, the methane gas sensor output signal is not only relevant to gas concentration, but also influenced by environmental factors, for example, sway, shake, temperature, and degradation of the chemical response of the sensitive materials (e.g., heating of wires or oxidation). These factors result in the gas sensor signal drifting and decreasing its detecting accuracy. Owing to the extensive use of such sensors, failures are inevitable. It is thus very necessary to distinguish the types of faults, and at the same time, to predict the occurrence time of faults in advance before they occur. Therefore, fault diagnosis and forecasting of gas sensors have become important issues for their applications [12,13,14].

In general, methods of fault detection for gas sensors can be grouped into 4 kinds: knowledge-based, model-based, data-driven, and hybrid/active [15,16,17]. As the data-driven method is fit for the task of complex data, it could accomplish the detection of fault diagnosis. A lot of professors and researchers are using this method in their papers [18,19]. Fault prediction methods used in sensors can be separated into the following two types: time-series and deep-learning (DL) prediction methods. The first method is based on the concept of statistics. After the pre-processing of differences, recursive prediction is performed on the data from front to back. The disadvantage of this model is that the prediction time is short and the accuracy is low. The second method combines statistical time-series forecasting methods and can predict a long time series with higher accuracy, but it needs to be trained for a long time.

Recently, DL [20,21,22] has been used to extract high dimensional features of fault diagnosis data and classify them directly, avoiding the shortcoming of requiring handcrafted features designed by engineers [23,24]. Therefore, a significant amount of DL methods have been extensively used in fault diagnosis [25,26]. Wen et al. [27] used a convolutional neural network (CNN) for fault diagnosis. Sun et al. [28] used a CNN plus random forest (RF) for sensor array diagnosis. From the work of previous researchers, we find that the algorithms used in these methods to perform fault diagnosis of sensor arrays have been less studied. The transformer is one of the most effective types of DL in the natural language processing (NLP) field.

Transformer models were proposed by Vaswani et al. (2017) [29] in the context of machine translation tasks [30,31,32,33], natural language, audio, and images [34,35,36]. However, these tasks often need an extremely high computational and memory cost. The difficulty generally arises from the global vision of self-attention, which transforms n dimension inputs with a quadratic memory O (N²). So, transformer models are difficult to train, and their context is restricted. In this case, the model had a limited length of time series to learn and could not learn the information on the data for as long as possible [37,38,39,40]. Recently, researchers have developed a number of methods to increase the context length without decreasing the results, and some of them have proposed sparse factorizations of the attention matrix which can reduce the self-attention complexity to O (N

\sqrt{N}

) [41]. Other researchers have also decreased the complexity to O (N log N) using locality-sensitive hashing [42]. This made the model scale to long sentences. Although the above models can be trained on large sequences with lower complexity, this does not increase autoregressive inference speed.

In this work, we propose a Gaussian-linearized transformer with tranquilized time-series decomposition methods with the computational complexity of O (N) for fault diagnosis and forecasting tasks of methane sensors. The proposed attention block of the model can automatically capture features of the signal of gas sensor by the query matrix and compute the autocorrelation between gas signals at different times by the key matrix and value matrix. Furthermore, the model can increase the accuracy of fault diagnosis and forecasting and effectively decrease the training time. The main contributions of the work are as follows:

The traditional self-attention mechanism was changed to reduce the time complexity to O (L). Firstly, Q and K were mapped to the Gelu function to obtain a Gaussian distribution, which was inspired by the kernel method. Secondly, in contrast to the traditional softmax calculation, we performed a softmax operation on Gelu (Q) and Gelu (K), and then multiplied the product of softmax (Gelu (K)) and V with Gelu (Q) to complete the calculation of the Gaussian-linearized attention. Finally, the time complexity of the model was reduced to O (L);
The traditional embedding block was changed into a tranquilized time-series decomposition block in the fault forecasting task. After the decomposition block, we obtained multivariate time-series sequences and transformed them into the Gaussian-linearized transformer model with improved accuracy of fault forecast data;
The complex fault environment was taken into account in the actual use of the methane sensor array. In practice, the number of sensor array faults occurs randomly and the fault models may be different at the same time. Therefore, combined with the above actual situation, the data were made closer to the actual situation.

2. Theoretical Fundamentals

2.1. Tranquilized Time-Series-Decomposition Embedding

The embedding block (Figure 1) was the first step in the proposed fault forecasting task, and its quality directly affects diagnosis accuracy. Traditional embedding methods are discrete and unfriendly to time-series prediction, so it was replaced with tranquilized time-series decomposition embedding in the fault forecasting task. First, the fault forecast involved tranquilizing data for better prediction in the later stage, and then decomposing the tranquilization into trend-cyclical and seasonal Init parts. Then the fault forecast data is juxtaposed with these trend-cyclical and seasonal Init parts. Finally, three curves were created and transformed to the encoder input.

For length-L input series X∈R^L^×D, we used a different method to tranquilize the original fault forecast data.

X_{T q l} = X_{n} - X_{n - 1}

(1)

where X_n and X_n−1 denote the sequence value of the current time T_(n) and the previous time T_(n−1).

Then, the time-series decomposition process is

X_{t} = A v g M o v (X_{T q l}) X_{s} = X_{T q l} - X_{t}

(2)

where X_s and X_t

\in R^{L \times D}

denote the seasonal and the extracted trend-cyclical part, respectively.

A v g M o v (X_{T q l}) = \frac{1}{m} \sum_{j = - k}^{k} X_{T q l + j} m = 2 k + 1

(3)

The trend period of time t is estimated by averaging the values of the time series within 2 k periods of t. Averages eliminate the randomness of the data, instead of the trend component. The average was calculated as follows:

X_{E m b e d d i n g} = [X_{T q l}, X_{t}, X_{s}]

(4)

where

X_{E m b e d d i n g} \in R^{L \times 3}

denotes our final Embedding output.

The main aspects of the method are described by Algorithm 1.

Algorithm 1 Algorithm of continuous sequence decomposition embedding.
function	embedding(X_i)
\|	X_tql ←Tranquilized our raw time series sequence X_i	Equation (1)
\|	X_t = [], X_s = []
\|	for j from 1 to K do
\|	\| X_j ← $\frac{1}{m} (X_{T q l + j})$ , m = K + 1	Equation (3)
\|	\| x_t += X_j
\|	\| x_s += X_Tql -X_t	Equation (2)
\|	X_t.append(x_t), X_s.append(x_s)
\|	end
\|	return X_embedding ← [X_tql, X_t, X_s]	Equation (4)
end

2.2. Gaussian-Linearized Attention

To accomplish this, the kernel method was used to solve nonlinear problems, which belonged to the complex data. The specific period transformed nonlinear problems into linear problems for easy solutions. In other words, the nonlinear data was mapped to high dimensional data, and in high dimensional space, linear separability was achieved. However, it was still difficult to compute the data in high dimensional space, so the alternative method was to compute the similarity measure in the feature space, instead of computing the coordinates of the vectors. Then, the algorithm that only needed the measured value was applied. The similarity measure was represented by a dot product and the kernel can be represented as K(x_i, y_j).

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

The standard Scaled Dot Attention is written in matrix form is

E^{'} = A t t e n t i o n (q, k, v) = s o f t m a x (\frac{Q K^{T}}{\sqrt{D}}) v

(5)

Equation (5) contains the softmax part which is the same as the exponential of the dot product between q and k. Given that subscribing a matrix with i returns the i-th row as a vector, we can write a generalized attention equation for any similarity function as follows:

E_{i}^{'} = \frac{\sum_{j = 1}^{N} s i m (q_{i}, k_{j}) v_{j}}{\sum_{j = 1}^{N} s i m (q_{i}, k_{j})} s i m (q, k) = e x p (\frac{q^{T}}{\sqrt{d}})

(6)

If sim (x) is non-negative, Equation (6) can be changed to other attentions, if

\emptyset (x)

and

φ (x)

are equal, it can be defined as kernel attention. Equation (6) can be described as follows:

E_{i}^{''} = \frac{\sum_{j = 1}^{N} \emptyset {(q_{i})}^{T} φ (k_{j}) v_{j}}{\sum_{j = 1}^{N} \emptyset {(q_{i})}^{T} φ (k_{j})}

(7)

Equation (7) was simplified by the associative property of matrix multiplication to

E_{i}^{'''} = \frac{\emptyset {(q_{i})}^{T} \sum_{j = 1}^{N} φ (k_{j}) v_{j}}{\emptyset {(q_{i})}^{T} \sum_{j = 1}^{N} φ (k_{j})}

(8)

Equation (8) can be simplified as follows when the numerator is written in vectorized form:

\emptyset (q) {φ (k)}^{T} v_{j} = \emptyset (q) {(φ (k)}^{T} v_{j})

(9)

In order to ensure that

\emptyset (x)

and

φ (x)

are non-negative, we normalized the query and key during the above attention. The definition of the kernel is a function that takes vectors in the original space as input vectors and returns the dot product of vectors in the feature space (after having transformed data space, possibly to higher dimensions). By using the kernel method, the query, the key, and the value were mapped to the Gelu function to replace

\emptyset (x)

, and let

\emptyset (x) = φ (x)

\emptyset (x) = φ (x) = G e l u (x)

(10)

where Gelu(x) denotes the Gaussian error linear units activation function. It is a high-performance neural network activation function, because the nonlinear change of Gelu(x) is a cumulative distribution function of the standard normal distribution, which meets the neural network expectation.

G e l u (x) = x P (X \leq x) = x ϕ (x) = x \cdot \frac{1}{2} [1 + e r f (\frac{x}{\sqrt{2}})]

(11)

We can approximate the Gelu function with

0.5 x (1 + t a n h [\sqrt{2 / π} (x + 0.044715 x^{3})])

or

x σ (1.702 x)

The following equation characterizes our final Gaussian distribution attention mechanism, and the module is a concrete implementation of the mechanism for computer vision data:

E^{''''} (Q, K, V) = δ_{q} (\emptyset (Q)) (δ_{k} {(φ (K))}^{T} V)

(12)

where

δ_{q}

and

δ_{k}

are normalization functions for the query and key features, respectively. The implementation of the same two normalization methods as the attention is

S c a l i n g : δ_{q} (X) = δ_{k} (X) = \frac{X}{\sqrt{n}} s o f t m a x : δ_{q} (X) = ω_{r o w} (X) δ_{k} (X) = ω_{c o l} (X)

(13)

where

ω_{r o w}

and

ω_{c o l}

denote applying the softmax function along each row and column of the matrix X, respectively.

Equation (11) can be simplified as follows when the numerator is written in vectorized form:

E'''' = s o f t m a x_{2} (\emptyset (Q)) s o f t m a {x_{1} (φ (K))}^{T} V

(14)

The proposed Gaussian-linearized attention can effectively reduce the time complexity O (n) of the model and make the data closer to the Gaussian distribution, which is conducive to model training. The main aspects of the method are described in Algorithm 2.

Algorithm 2 Gaussian-linearized attention.
function	Attention(X_embedding)
｜	Q, K, V ← X_embedding
｜	$\emptyset (x), φ (x)$ ← Gelu(x)	Equation (10)
｜	for i from 1 to N do
｜	\| for j from 1 to N do
｜	\| \| $E_{i}^{''''}$ ← $δ_{q} (\emptyset (Q_{i}))$ · $(δ_{k} {(φ (K_{j}))}^{T} V_{j})$	Equation (12)
｜	\| end
｜	end
｜	return $E^{''''}$	Equation (14)
end

In the Supplementary Materials (SM), the Gaussian-linearized attention is theoretically equivalent to traditional self-attention. The architecture of dot-product and Gaussian linearized attention is shown in Figure 2. Moreover, in Section 3 of the SM, the necessity of reducing the time complexity of the model to O (N) is proved by contrasting the forecasting data length.

2.3. Encoder Stacks

The encoder layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. A residual connection was employed around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm. (x + Sublayer(x)), where Sublayer(x) is the function implemented by the sublayer itself. To facilitate these residual connections, all sublayers in the model, as well as the embedding layers, are taken into account. The encoder block is shown in Figure S1.

The parameters used in the fault-diagnosis task are shown in Table S2. In this task, considering that the complexity of the fault data is greater than that of the fault-prediction task, sufficient attention was needed to learn the correlation between data of each fault type in the training process, so it increased the parameter head to 4. In the middle layer, the classical transformer model was used without any changes. The dropout operation in Layer 3 and Layer 4 was used to prevent the model from being overfitted during training, resulting in poor performance on the test set.

The parameters used in the fault-prediction task are also shown in Table S3. The traditional embedding layer was changed to tranquilized time-series decomposition embedding, which linearized the structure, and our prediction data were the matrix [L, 3]. Therefore, the value of d_Model could only be 3. Considering that the predicted data were not particularly complex, the encoder was changed from N to 1 and maintained head as 3, so that the model could learn the correlation between the fault data and its trend and seasonal value. The traditional transformer model settings were maintained for the hidden layer, with a d_ff value of 128. Similarly, the dropout techniques were used in Layer 3 and Layer 4 to prevent overfitting the model, which would lead to low performance.

3. Experiment and Validation of the Proposed Method

3.1. Experiment Setup

All data in this paper are from the public dataset. A system diagram of methane sensor arrays is shown in Figure 3a. The experimental system is mainly composed of six parts: the gas circuit part, the gas chamber part, the fault transmitter, the communication module, and the host computer. The air path part includes a standard air bag; the air chamber part includes five parts: sensor array, environmental sensor, humidifier, heater, and condensation device. The environmental stress simulator includes a vibration generator and a sway generator.

The fault transmitter includes a program-controlled relay, program-controlled potentiometer, and program-controlled voltage device. The communication module realizes two-way communication and data transmission with the host computer. The host computer is the main platform of the test system, which is responsible for sending down control commands and receiving all kinds of temperature, humidity, air pressure, vibration, sway, and gas concentration detection information from the communication module, collecting sensor array output signals, extracting characteristic information, and processing fault data. The structure of the sensor signal pickup circuit is shown in Figure 3b. The MQ-6 gas sensor cylinder core structure is shown in Figure 3c. The program was run on a 2.8-GHz Intel CPU with 16 GB of RAM running Windows 10.

In the fault-diagnosis task, the original curve of sensor fault data is shown in Figure 4. A total of 9 primary failure types and methane background gas were tested, i.e., heating wire disconnection irregularly (HWDI) fault, heating wire disconnection (HWD) fault, exfoliation of sensitive body irregularly (ESBI) fault, and exfoliation of sensitive body (ESB) fault, HWDI+HWD, ESBI+ESB, ESBI+HWD, offset fault (OF), partial exfoliation of the sensitive body (PESB), and normal.

The methane gas curves were combined with six fault models, as shown in Table S1. There are 6 models of fault diagnosis patterns, all of which are based on methane gas. In the real working conditions of the methane sensor array, the types of failures are extremely complex. Considering that there is no single fault in the methane sensor array under actual conditions, different sensors may have different failures at the same time. Therefore, we added different faults in each pattern to simulate the situation under actual conditions. Our six fault-diagnosis models are as complex as real conditions, which can be equivalent to the fault−diagnosis task in a real situation. The noise information and sway information were also added in the background state to simulate the complex environment. There were 37,200 pieces of training and test data for each failure mode. The sensors’ fault diagnosis task data are shown in Figure S2. From Figure S2, the data are slightly different for each model. The correlation between sensors of the fault diagnosis task is shown in Figure S3. From Figure S3, each sensor has a certain similarity with other sensors, which generally occurs in real situations, and it also increases the difficulty of the sensor fault diagnosis.

Regarding the fault−prediction task, considering the actual working conditions, it would have improved the work efficiency and saved costs to detect the methane sensor failure in the early stage and replace the sensor in time. The data are derived from the sensor ESBI situation, and as the failure occurs, the failure time increases and the interval decreases until an ESB occurs. The data are composed of 6000 points, the first peak is the early warning of failure under actual working conditions, in order to alert the staff to the occurrence of the sensor fault. The data of the fault forecasting task are plotted in Figure 5.

3.2. The Flowchart of the Fault Diagnosis Process

The Gaussian−linearized transformer model is developed as an intelligent model for fault diagnosis based on multisensor data for variable operating conditions. The flowchart of the fault diagnosis process is shown in Figure 6, and the diagnostic process is summarized as follows:

Step 1: The fault diagnosis is based on methane−gas signals collected under different conditions, and signals are appropriately pre−processed to generate training and testing datasets. A 10-fold cross−validation method was used to separate training and testing data;

Step 2: A fault−diagnosis system combining multi-sensor signals was established based on a Gaussian−linearized transformer model. The training samples from Step 1 were fed into the model and trained offline through multiple iterations to achieve the extraction and fusion of multi−sensing features and fault classification;

Step 3: The testing dataset was generated in Step 1 pretreatment. The test samples were then input into the trained model for directly diagnosing the rolling bearings under different conditions.

3.3. Validation of Fault Diagnosis Method and Inference

The training and test sample datasets consisted of 37.2 k data, respectively. Validation used the six fault types listed in Table S1. We adopted a 10-Fold approach in training which took nine parts as the training set and one part as the test set. The Gaussian-linearized transformer model was trained for 100 iterations. In the training process, the batch size was 50 with a learning rate of 0.01; the latter was set to the dynamic learning rate. With the number of training epochs increasing, the learning rates decreased by 0.0005 every 20 epochs, so as to make the model find the global optimum faster. The results are shown in Table 1. The test accuracy reached 99.75% when the test time and training time were 0.01 and 480 s, respectively.

To evaluate the performance of the proposed method, other traditional methods were selected to compare prediction accuracy.

The selected methods were CNN+RF, Traditional Transformer_Encoder (Transformer_En), and the Gaussian-linearized transformer (GLTrans.). The comparison results are shown in Table 1. It shows that CNN+RF had the lowest training time, 300 s and low accuracy, 96%, respectively, while Transformer_En required 600 epochs and reached 98% accuracy.

3.4. Fault Diagnosis Task Results by Confusion Matrix

As shown in Figure 7, the result of sensor fault diagnosis shows that the classification accuracy can reach as high as 99.75%. It can be seen from the accuracy confusion matrix that the accuracy of the six modes has reached a very high level, and there is no interaction between the modes. This shows that when the model learns different patterns, the abstract features are fully learned, and the high quality of data is also one of the reasons for the good experimental results.

3.5. Validation of Fault Forecast Method and Inference

The proposed method was trained with the ADAM optimizer and the initial learning rate was 0.005, which decreased by 0.0005 every 10 epochs. The batch size was set to 20. Three fault-prediction experiments were conducted, the training period was {900, 3700}, the prediction period {2900, 4700} to predict 1000 points, {3000, 4800} to predict 1100 points, and {3100, 4800} to predict 1200 points. The results of the model are shown in Table 2. Considering the features of time-series sequencing, we distinguish the training set and testing set on the timeline, with the training set in the front and the testing set in the back. In addition, we set the input value (model input x) to be before the output value (model output y) on the timeline, which is in line with the natural law of prediction tasks. The natural law of prediction tasks states that prediction behavior generally involves analyzing historical events to infer future events.

Several traditional time-series models and DL models suitable for prediction tasks were chosen as our comparison models, i.e., autoregressive integrated moving average (ARIMA), generalized autoregressive conditional heteroscedasticity (GARCH), deep autoregressive (DeepAR), Deepstate, and attention-based long short-term memory (AT-LSTM). During the experiments, the traditional time-series model could only predict a short time, and did not perform well in a long-series fault-prediction task. Compared with the traditional time-series model, the mean-square-error loss (MSELoss) of the Gaussian-linearized transformer was approximately 0.04. Regarding the DL comparison models, we found that AT-LSTM performed the best, with a MSEloss of 0.07. This is because the attention mechanism was added to the LSTM model, which made it easier to learn important relevant information during the training time.

3.6. Visualization of Fault Forecast Task Results

The results of sensor fault predictions shown in Figure 8a–c represented the results of predicting 1000, 1100, and 1200 points, respectively. From Figure 8, the prediction results of traditional statistical models are relatively poor, followed by the statistical models DeepAR and Deep State which combine with neural networks. Because AT-LSTM has the attention mechanism and was combined with the LSTM model, which was suitable for the prediction task, the prediction effect is the best among the comparison models.

3.7. Attention Visualization for Fault Forecast Task Training Process

The thermodynamic diagram of the attention mechanism in the training process of the prediction task is shown in Figure 9. It represented the thermodynamic diagram of attention after training 10, 20, 50, and 100 epochs, respectively. From Figure 9, when the number of rounds increases, the attention has different results.

3.8. Contrast of Memory Cost with Different Models

Several related efforts were chosen to reduce time complexity, as shown in Table 3. Three comparison models were employed, i.e., Trans.-XL [43], Sparse Trans. [44], and Reformer [45]. It can be seen from the table that the previous work minimized the time complexity to O (N log N), while the current work GLTrans used kernel RC and the low-Rank method to reduce the time complexity to O (N). The performance of the model was not lost.

4. Conclusions

In conclusion, a novel transformer with a tranquilized time-series decomposition method was proposed for the fault diagnosis and fault forecasting of sensor arrays. The Gaussian-linearized transformer (GLTrans.) method was used as the proposed transformer’s attention module and normalized both query and key before the dot product with only O (L) memory cost in the fault forecasting prediction task, thereby improving forecasting accuracy. The accuracy of fault diagnosis reached 99% with the proposed method, which is superior to that of AT-LSTM and other methods. Both the mean-square-error loss (0.04) and mean absolute error loss (0.15) of fault forecasting using the proposed approach are very low compared with traditional time-series prediction models and other deep-learning models. The proposed model GLTrans. has only O (N) time complexity compared with the original transformer model’s O (n²) time complexity. Results show that the Gaussian-linearized transformer model provides a good solution for fault forecasting.

In the next scientific experiments, we will try to test sensor failure and predict the time of failure in more complex engineering environments.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14010218/s1.

Author Contributions

Conceptualization, K.Z., W.N. and Y.Z.; formal analysis, Z.L. and T.W.; methodology, W.J.; writing—original draft preparation, K.Z.; writing—review and editing, K.Z., M.Z. and Z.Y.; funding acquisition, M.Z. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (2022YFC3104700), the National Natural Science Foundation of China (62371299, 62301314, and 62101329), the China Postdoctoral Science Foundation (2023M732198), the Oceanic Interdisciplinary Program of Shanghai Jiao Tong University (SL2020ZD203, SL2021MS006 and SL2020MS031), the Scientific Research Fund of Second Institute of Oceanography, Ministry of Natural Resources of China (SL2003), and the Startup Fund for Youngman Research at Shanghai Jiao Tong University. We also acknowledge analysis support from the Instrumental Analysis Center of Shanghai Jiao Tong University and the Center for Advanced Electronic Materials and Devices of Shanghai Jiao Tong University. The computations in this paper were run on the π 2.0 cluster supported by the Center for High Performance Computing at Shanghai Jiao Tong University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

For privacy reasons, given the sensitive nature of the data, the aggregated data analyzed in this study will not be publicly disclosed but might be available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Hyo, L.; Jong, S.; Sang, B. Alternative risk assessment for dangerous chemicals in Republic of Korea regulation: Comparing three modeling programs. Int. J. Environ. Res. Public Health 2018, 15, 1600. [Google Scholar]
George, F.; Ludmila, O.; Nelly, M. Semiconductor gas sensors based on Pd/SnO₂ nanomaterials for methane detection in air. Nanosc. Res. Lett. 2017, 12, 329. [Google Scholar]
Cedric, B.; Matthew, M.; Douglas, P. A novel low-cost high performance dissolved methane sensor for aqueous environments. Opt. Express 2008, 16, 12607–12617. [Google Scholar]
Fukasawa, T.; Hozumi, S. Dissolved methane sensor for methane leakage monitoring in methane hydrate production. In OCEANS 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 15, pp. 1–6. [Google Scholar]
Boulart, C.; Connelly, D.P.; Mowlem, M.C. Sensors and technologies for in situ dissolved methane measurements and their evaluation using technology readiness levels. Trac Trends Anal. Chem. 2010, 29, 186–195. [Google Scholar] [CrossRef]
Lamontagne, A.; Rose, S. Response of METs Sensor to Methane Concentrations Found on the Texas-Louisiana Shelf in the Gulf of Mexico; Naval Research Laboratory: Washington, DC, USA, 2001; Volume 15, pp. 1–10. [Google Scholar]
Ke, W.; Svartaas, T.M.; Chen, D. A review of gas hydrate nucleation theories and growth models. J. Nat. Gas Sci. Eng. 2019, 61, 169–196. [Google Scholar] [CrossRef]
Sun, Y.; Zhao, H. In-situ detection of ocean floor seawater and gas hydrate exploration of the South China Sea. Earth Sci. Front. 2017, 24, 225–241. [Google Scholar]
Li-fu, Z.; Kang, Q. The development of in situ detection technology and device for dissolved methane and carbon dioxide in deep sea. Mar. Geol. Front. 2022, 38, 1–18. [Google Scholar]
Xijie, Y.; Huaiyang, Z. The evidence for the existence of methane seepages in the northern South China Sea: Abnormal high methane concentration in bottom water. Acta Oceanol. Sin. 2008, 30, 69–75. [Google Scholar]
Jia-ye, Z.; Xian-qin, W. The dissolved methane in seawater of estuaries, distribution features and formation. J. Oceanogr. Huanghai Bohai Seas 1997, 15, 20–29. [Google Scholar]
Chen, Y.S.; Xu, Y.H. Fault detection, isolation, and diagnosis of status self-validating gas sensor arrays. Rev. Sci. Instrum. 2010, 87, 045001. [Google Scholar] [CrossRef]
Sana, J.; Young, L.; Jungpil, S. Sensor fault classification based on support vector machine and statistical time-domain features. IEEE Access 2017, 5, 8682–8690. [Google Scholar]
Yang, J.; Chen, Y. An efficient approach for fault detection, isolation, and data recovery of self-validating multifunctional sensors. IEEE Trans. Instrum. Meas. 2017, 66, 543–558. [Google Scholar] [CrossRef]
Zhi-wei, G.; Carlo, C. A survey of fault diagnosis and fault tolerant techniques—Part I: Fault diagnosis with model-based and signal-based approaches. IEEE Trans. Ind. Electron. 2015, 62, 3757–3767. [Google Scholar]
Lu, J.; Huang, J.; Lu, F. Sensor fault diagnosis for aero engine based on online sequential extreme learning machine with memory principle. Energies 2017, 10, 39. [Google Scholar] [CrossRef]
Fang, D.; Su, G.; Rui, Z. Sensor multi-fault diagnosis with improved support vector machines. IEEE Trans. Autom. Sci. Eng. 2017, 14, 1053–1063. [Google Scholar]
Li, H.; Meng, Q. Fault identification of hydroelectric sets based on time-frequency diagram and convolutional neural network. In Proceedings of the 2019 IEEE 8th International Conference on Advanced Power System Automation and Protection (APAP), Xi’an, China, 21–24 October 2019. [Google Scholar]
Qing, L.; Huang, H. Comparative study of probabilistic neural network and back propagation network for fault diagnosis of refrigeration systems. Sci. Technol. Built Environ. 2018, 24, 448–457. [Google Scholar]
Shao, H.; Jiang, H.; Zhang, X.; Niu, M. Rolling bearing fault diagnosis using an optimization deep belief network. Meas. Sci. Technol. 2015, 26, 115002. [Google Scholar] [CrossRef]
Long, W.; Liang, G. A new deep transfer learning based on sparse auto-encoder for fault diagnosis. IEEE Trans. Syst Man. Cybern. Syst. 2019, 49, 136–144. [Google Scholar]
He, W.; Qiao, P.L. A new belief-rule-based method for fault diagnosis of wireless sensor network. IEEE Access 2018, 6, 9404–9419. [Google Scholar] [CrossRef]
Ma, S.; Chu, F. Ensemble deep learning-based fault diagnosis of rotor bearing systems. Comput. Ind. 2019, 105, 143–152. [Google Scholar] [CrossRef]
Zhan, Z.; Hua, H. Novel application of multi-model ensemble learning for fault diagnosis in refrigeration systems. Appl. Thermal Eng. 2020, 164, 114–516. [Google Scholar]
Wang, T.; Li, Q. Transformer fault diagnosis method based on incomplete data and TPE-XGBoost. Appl. Sci. 2023, 13, 7539. [Google Scholar] [CrossRef]
Shi, L.; Su, S.; Wang, W.; Gao, S.; Chu, C. Bearing fault diagnosis method based on deep learning and health state division. Appl. Sci. 2023, 13, 7424. [Google Scholar] [CrossRef]
Wen, L.; Li, X.; Gao, L.; Zhang, Y. A new convolutional neural network- based data-driven fault diagnosis method. IEEE Trans. Ind. Electron. 2018, 65, 5990–5998. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, H. A new convolutional neural network with random forest method for hydrogen sensor fault diagnosis. IEEE Access 2020, 8, 85421–85430. [Google Scholar] [CrossRef]
Ashish, V.; Noam, S.; Niki, P.; Jakob, U. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Bahdanau, D.; Cho, K. Neural machine translation by jointly learning to align and translate. arXiv 2016, arXiv:1409.0473. [Google Scholar]
Lin-hao, D.; Shuang, X.; Bo, X. Speech-transformer: A norecurrence sequence-to-sequence model for speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
Tom, B.; Benjamin, M. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Ze, L.; Yu-tong, L. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Rami, A.; Dokook, C.; Noah, C. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, p. 011920. [Google Scholar]
Zi, D.; Zhi, Y. Transformer-XL: Language modeling with longer-term dependency. In Proceedings of the ICLR, New Orleans, LO, USA, 6–9 May 2019. [Google Scholar]
Parmar, N.; Vaswani, A. Image transformer. Proc. Mach. Learn. Res. 2018, 80, 4055–4064. [Google Scholar]
Wilson, A.G.; Hu, Z. Deep kernel learning. Proc. Mach. Learn. Res. 2021, 51, 370–378. [Google Scholar]
Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Proceedings of the NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Wang, S.; Li, B. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
Xiong, Y.; Zeng, Z. Nystromformer: A nystrom-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, p. 16. [Google Scholar]
Yi, T.; Mostafa, D. Efficient transformer: A Survey. ACM J. 2020, 55, 6. [Google Scholar]
Yi, T.; Mostafa, D. Long Range Arena: A Benchmark for Efficient Transformers. ICLR 2021, 23, 1022–1032. [Google Scholar]
Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. arXiv 2019, arXiv:1901.02860. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
Nikita, K.; Lukasz, K.; Anselm, L. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]

Figure 1. Embedding of fault forecast task.

Figure 2. Illustration of the architecture of dot-product and Gaussian-linearized attention.

Figure 3. (a) Experimental system diagram for methane gas sensor array; (b) Gas sensor circuit diagram; (c) The pictures of methane gas sensor.

Figure 4. Different kinds of original signal faults and original methane gas, (a) The signal of the HWD Fault; (b) The signal of HWDI Fault; (c) The signal of OF Fault; (d) The signal of ESB Fault; (e) The signal of ESBI Fault; (f) The signal of PESB Fault; (g) The signal of ESBI + ESB Fault; (h) The signal of HWDI + HWD Fault; (i) The signal of ESB+HWD Fault; (j) The signal of normal.

Figure 5. Data of fault forecast task, (a) The signal of original Fault forecast; (b) The signal of trend Fault forecast; (c) The signal of seasonal Fault forecast.

Figure 6. Flowchart of the fault diagnosis process.

Figure 7. Fault diagnosis task results by confusion matrix.

Figure 8. Visualization of fault forecast task results, (a) 1000 points prediction of fault forecast task; (b) 1100 points prediction of fault forecast task; (c) 1200 points prediction of fault forecast task.

Figure 9. Attention visualization for fault forecast task, prediction period I = {2900, 4700}, prediction length O = 1000 points, (a) Epoch 10, MSE 0.15, MAE 0.34; (b) Epoch 20, MSE 0.08, MAE 0.26; (c) Epoch 50, MSE 0.05, MAE 0.2; (d) Epoch 100, MSE 0.04, MAE 0.15.

Table 1. Diagnosis accuracy based on different methods.

Model	Training Time (s)	Accuracy	Recall	Precision	F1 Score (0, 1)	Testing Time (s)
CNN+RF	600	96%	96%	96.84%	96.45%	0.02
Transformer_En	600	98%	98%	98.76%	98.56%	0.02
GLTrans.	480	99.75%	99.75%	99.99%	99.86%	0.01

Table 2. Fault forecast task of methane sensor results.

Prediction Period I	Loss	{2900, 4700}	{3000, 4800}	{3100, 4900}
Prediction Length O	Loss	1000	1100	1200
ARIMA	MSE	3.75	4.96	8.8
	MAE	1.56	1.89	2.36
Garch	MSE	0.85	1.48	2.66
	MAE	1.03	1.21	1.31
DeepAR	MSE	0.09	0.14	0.18
	MAE	0.21	0.28	0.3
DeepState	MSE	0.08	0.1	0.13
	MAE	0.21	0.25	0.27
AT-LSTM	MSE	0.07	0.09	0.11
	MAE	0.2	0.23	0.26
GLTrans.	MSE	0.04	0.08	0.10
	MAE	0.15	0.2	0.25

Table 3. Summary of efficient transformer models. Class abbreviations include: FP = Fixed Patterns or Combinations of Fixed Patterns, LP = Learnable Pattern, LR = Low Rank, KR = Kernel, and RC = Recurrence.

Models	Complexity	Decode	Class
Trans.-XL (Dai et al., 2019) [43]	O (n²)	$\sqrt$	RC
Sparse Trans. (Child et al., 2019) [44]	O (n $\sqrt{n}$ )	$\sqrt$	FP
Reformer (Kitaev et al., 2020) [45]	O (n log n)	$\sqrt$	LP
GLTrans.	O (n)	$\times$	KR+LR

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, K.; Ning, W.; Zhu, Y.; Li, Z.; Wang, T.; Jiang, W.; Zeng, M.; Yang, Z. Gaussian-Linearized Transformer with Tranquilized Time-Series Decomposition Methods for Fault Diagnosis and Forecasting of Methane Gas Sensor Arrays. Appl. Sci. 2024, 14, 218. https://doi.org/10.3390/app14010218

AMA Style

Zhang K, Ning W, Zhu Y, Li Z, Wang T, Jiang W, Zeng M, Yang Z. Gaussian-Linearized Transformer with Tranquilized Time-Series Decomposition Methods for Fault Diagnosis and Forecasting of Methane Gas Sensor Arrays. Applied Sciences. 2024; 14(1):218. https://doi.org/10.3390/app14010218

Chicago/Turabian Style

Zhang, Kai, Wangze Ning, Yudi Zhu, Zhuoheng Li, Tao Wang, Wenkai Jiang, Min Zeng, and Zhi Yang. 2024. "Gaussian-Linearized Transformer with Tranquilized Time-Series Decomposition Methods for Fault Diagnosis and Forecasting of Methane Gas Sensor Arrays" Applied Sciences 14, no. 1: 218. https://doi.org/10.3390/app14010218

APA Style

Zhang, K., Ning, W., Zhu, Y., Li, Z., Wang, T., Jiang, W., Zeng, M., & Yang, Z. (2024). Gaussian-Linearized Transformer with Tranquilized Time-Series Decomposition Methods for Fault Diagnosis and Forecasting of Methane Gas Sensor Arrays. Applied Sciences, 14(1), 218. https://doi.org/10.3390/app14010218

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gaussian-Linearized Transformer with Tranquilized Time-Series Decomposition Methods for Fault Diagnosis and Forecasting of Methane Gas Sensor Arrays

Abstract

1. Introduction

2. Theoretical Fundamentals

2.1. Tranquilized Time-Series-Decomposition Embedding

2.2. Gaussian-Linearized Attention

2.3. Encoder Stacks

3. Experiment and Validation of the Proposed Method

3.1. Experiment Setup

3.2. The Flowchart of the Fault Diagnosis Process

3.3. Validation of Fault Diagnosis Method and Inference

3.4. Fault Diagnosis Task Results by Confusion Matrix

3.5. Validation of Fault Forecast Method and Inference

3.6. Visualization of Fault Forecast Task Results

3.7. Attention Visualization for Fault Forecast Task Training Process

3.8. Contrast of Memory Cost with Different Models

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI