Long-Term Prediction Model for NOx Emission Based on LSTM–Transformer

: Excessive nitrogen oxide (NOx) emissions result in growing environmental problems and increasingly stringent emission standards. This requires a precise control for NOx emissions. A prerequisite for precise control is accurate NOx emission detection. However, the NOx measurement sensors currently in use have serious lag problems in measurement due to the harsh operating environment and other problems. To address this issue, we need to make long-term prediction for NOx emissions. In this paper, we propose a long-term prediction model based on LSTM–Transformer. First, the model uses self-attention to capture long-term trend. Second, long short-term memory network (LSTM) is used to capture short-term trends and as secondary position encoding to provide positional information. We construct them using a parallel structure. In long-term prediction, experimental results on two real datasets with different sampling intervals show that the proposed prediction model performs better than the currently popular methods, with 28.2% and 19.1% relative average improvements on the two datasets, respectively.


Introduction
Nitrogen oxide (NOx) is one of the major exhaust pollutants causing atmospheric pollution, which is dominated by NOx emissions from industrial sources, accounting for 40.9% of total NOx emissions [1].A large amount of NOx generated from rotary kilns during the sintering process is one of the major sources of NOx emissions from industrial sources.Excessive NOx emissions can endanger the ecosystem and human health.Due to the pressure from environmental problems, NOx emission standards also have become increasingly strict [2,3].Therefore, this poses a major challenge for NOx emission control in rotary kilns.
For rotary kilns, to reduce the emissions of NOx from combustion, there are currently two main methods [4]: low NOx emission combustion technology and flue gas denitrification technology.Using the former alone does not allow NOx emissions to meet the requirements of the standard.Among the latter, selective catalytic reduction (SCR) is widely used for flue gas removal due to its economy with its very high nitrogen removal efficiency [5].The principle of SCR technology is that a suitable dose of reductant (such as ammonia) is given according to the NOx concentration at the reactor outlet, which reacts with NOx to produce nitrogen gas and water.Insufficient ammonia injection can result in not effectively removing NOx.Excessive ammonia injection can result in ammonia pollution due to ammonia leakage, and its by-products (such as H 4 HSO 4 ) can endanger equipment performance and safety operation issues.Continuous emission monitoring system (CEMS) is now widely used in rotary kilns to obtain NOx emission concentrations.However, due to the performance of the flue gas analyzer and the length of the flue gas sampling pipe, there is a certain degree of lag error in the measured values.Furthermore, the harsh working environment can lead to aging or damage of the measurement components, which means that satisfactory measurement accuracy cannot be guaranteed [6,7].
Meanwhile, using expensive, high-precision measuring equipment is not practical, considering the economy and the actual production needs.Therefore, establishing an accurate NOx emission concentration prediction model is essential to achieve low NOx emission.
Researchers have proposed a variety of methods for predicting NOx emissions.These methods can be mainly classified into mechanism-based and data-driven types.The mechanism-based method is computationally expensive and requires expert knowledge.For some complex systems, it is difficult to obtain their expressions, which undoubtedly makes NOx emission prediction very difficult.In contrast, data-driven methods do not require a priori knowledge and use data to model the mapping relationships between variables.Due to the ability to model complex nonlinear systems, support vector machine (SVM) and artificial neural network (ANN) are widely used to predict NOx emissions.Some works from the literature [8,9] constructed a NOx emission prediction model for coalfired boilers based on SVM.Other studies [10][11][12] used ANN to construct NOx emission prediction models.Although SVM and ANN have good nonlinear modeling capabilities, ANN's network structure and parameters are difficult to determine and suffer from overfitting problems.For SVM, a large number of input variables and datasets can make the training process computationally difficult as well as for obtaining optimal solutions, with both difficulties hindering the capture of temporal features in time series.
Deep learning techniques have been developed rapidly in recent years and are widely used for time series prediction tasks.Recurrent neural network (RNN) [13] is one of the most popular deep learning methods used for time series prediction.In the literature [14,15], virtual sensors for NOx emission prediction were constructed based on RNN.Although RNN can model short-term dependencies effectively, it has problems modeling long-term dependencies due to the gradient vanishing and explosion problem [16].Hochreiter and Schmidhuber proposed a long short-term memory network (LSTM) [17], which addresses the problem of long-term dependencies to some extent.It can capture long-term dependencies through gating mechanisms with memory units while maintaining the ability to model short-term dependencies.It has become the main method currently used to predict NOx emissions.Tan et al. [18] constructed a single-step prediction model for NOx emissions from coal-fired boilers based on LSTM.Yang et al. [19] studied a NOx emission prediction method combining principal component analysis (PCA) and LSTM.He et al. [20] proposed and validated a NOx emission prediction model based on CNN-LSTM [21], where CNN is used to extract features from multi-dimensional data, and LSTM is used to identify relationships between different time steps.Xie et al. [7] used the maximal information coefficient (MIC) as a method for feature selection and designed a sequence-to-sequence (S2S) multi-step NOx emission prediction model with an attention mechanism (AM) based on LSTM.Wang et al. [22] constructed a hybrid model for NOx emission prediction based on complete ensemble empirical mode decomposition adaptive noise (CEEMDAN) with AM and LSTM.The network structure of S2S usually consists of two parts, the encoder and the decoder [23][24][25][26].The encoder encodes the input data as a fixed-length vector, and the decoder decodes the vector to generate the desired output.This structure has been successful in natural language processing (NLP) because of the ability to combine contextual information and achieve variable-length output without being limited to the length of the input sequence.However, the problem with S2S network architecture is that the performance of the encoder-decoder deteriorates rapidly as the length of the input sequence increases [25].To solve this problem, scholars [27,28] proposed an S2S structure with an attention mechanism.This attention mechanism can calculate the correlations between the hidden states of the encoder and decoder to highlight the most valuable information, thus improving the performance of the S2S structure for long-sequence inputs.
In practical applications, there is a certain delay between the input and target variables due to the large hysteresis and inertia of rotary kilns and the influence of measurement lag errors.Therefore, we expect an accurate long-term prediction of NOx emissions to control in advance and thus overcome the impact of delay.However, the task of long-term prediction is hugely challenging, and it demands models that can effectively capture long-term dependencies.Although LSTM networks for NOx emission prediction are continuously improving, the structure of LSTM limits the ability to make long-term prediction.The literature [29] noted that as the prediction time range becomes longer, the inference speed of the LSTM network decreases rapidly, and the model starts to fail.
The Transformer architecture [30] has been widely used for NLP and has achieved stateof-the-art results in many tasks due to its demonstrated power in processing sequential data.Benefiting from the self-attention mechanism, it has a strong performance in capturing longterm dependencies, which brings the possibility to solve long-term prediction tasks.But it may not be appropriate to use Transformer directly for time series prediction tasks, due to the following reasons: (1) Transformer is mainly proposed for NLP and is a classification task rather than a regression task; (2) the self-attention mechanism can disrupt the continuity of the time series, which leads to the loss of correlation; (3) the sin-cos method is used in the transformer structure to encode the position of the sequential data, but this method does not provide enough position information [31].
Most of the current work focuses on short-term prediction, and there are currently limited research studies in the literature on NOx emission prediction using Transformer.In this study, we propose a long-term prediction model for NOx emission based on LSTM-Transformer.The improvements are as follows: (1) The model uses self-attention to capture long-term dependencies and uses LSTM to capture short-term dependencies.The simultaneous consideration of long-term and short-term patterns enables the model to not lose crucial fine-scale information and thus make accurate long-term predictions.(2) In view of the shortcomings of the self-attention and sin-cos position encoding in the time series prediction task, LSTM can be used to maintain the temporal continuity of the time series data and to learn the position information.Finally, (3) parallel-designed structures can improve computational efficiency.
The rest of this paper is organized as follows.Section 2 describes the investigated rotary kiln.Section 3 details the proposed model.Section 5 presents the detailed experimental results and discussion.Section 6 concludes the study.

Model Object Description
In this paper, an alumina rotary kiln is used as a research object to investigate the problem of NOx emission concentration prediction at the outlet of the SCR reactor.The alumina rotary kiln components and process flow are shown in Figure 1.

Continuous Emission Monitoring System
Continuous emission monitoring system (CEMS) are currently used to measure NOx emissions inside rotary kilns.For NOx measurement, the extraction condensation method is usually used.The principle is to measure the NOx content in the flue gas using UV absorption, measure the wet oxygen content via electrochemical methods, and then calculate the dry flue gas concentration of NOx by wet-dry conversion.When measuring NOx emission concentrations, the gas is sampled by the sampling probe and then sent through the sampling line to the gas pollutant analyzer.For other operating parameters, for example, the temperature of the flue gas is measured using a temperature sensor, the flue gas pressure is measured using a pressure sensor, the flue gas flow rate is measured using the Pitot tube method, and the humidity of the flue gas is measured using the capacitance/resistance method.All measurement signals are fed into the data acquisition and processing system.

Statement of Existing Problems
It is difficult to set the CEMS analyzer near the sampling point in engineering applications, resulting in a longer heat tracing sampling line.This can result in a long delay in sampling the flue gas.The delay of the measurement link makes the ammonia injection unable to respond to the concentration change of NOx in time.In the SCR reactor, the adsorbed fly ash in the flue gas adheres to the catalyst, reducing the contact area of the catalyst with the flue gas and the efficiency of denitration reaction.In air preheaters, they absorb moisture from flue gas to clog and corrode equipment.This will lead to an increase in fan power consumption, affecting the safe operation of the unit, or even limit the load-carrying capacity of the unit.Reactors often accompany a decrease in denitrification efficiency after a period of operation.The main reasons for the deactivation of SCR denitrification catalysts include mechanical wear, clogging, sintering aging, and catalyst poisoning.In summary, the ammonia injection control target has a large delay due to the delay in the CEMS measurements and in the SCR chemical reaction process.

LSTM
The LSTM structure is proposed to solve the problem that RNN cannot handle longterm dependencies efficiently.LSTM mainly uses a gating mechanism to control information updates.Figure 2 illustrates the specific structure of the LSTM unit, which consists of three gate structures, the forget gate, the input gate, and the output gate, and uses the memory cell to store historical information.Among them, the forget gate removes unimportant information from the memory cell, the input gate controls the new information that will be added to the memory cell, and the output gate determines the output based on the cell state.The specific calculation process is expressed as follows: where c t and h t are the cell state and hidden state at time t, respectively.f t , i t , and o t are the forget gate, the input gate, and the output gate, respectively.W x• represents the weight matrix connected to the input layer; W h• represents the weight matrix connected to the hidden; and b• is the bias vector.⊗ represents element-wise multiplication.

Self-Attention Mechanism
The self-attention mechanism simulates the ability of a human to focus on processing information, which enables the machine to selectively allocate attention resources to more critical parts rather than the whole, thus improving the quality and efficiency of information acquisition and the performance of the model.
Transformer is based solely on the self-attention mechanism without recurrence and convolutions.The attention used by Transformer is the standard dot product attention, and the input consists of the query (Q), key (K), and value (V).Suppose there is a sequence of inputs X ∈ R l×d model , then Q, K, and V can be obtained by linear transformation, as follows: where In the standard dot product attention, the dot product of Q and all of K is calculated and scaled by √ d k .The softmax function is used to obtain the weight of each value, which is multiplied by V to select the attention assignment.It is defined as follows: Transformer extends self-attention to multi-head attention, which uses N different learned linear projections on Q, K, and V for N times, called N attention heads.The different attention heads focus on different dimensions of information, which are computed in parallel and are concatenated and projected again to obtain the final output value, which can be defined as follows: where head i represents the self-attention distribution of head i.W Q i , W K i , and W V i represents the linear projection parameter matrices of head i, which are calculated similarly to the self-attention mechanism, and W O represents the parameter matrix of the output projection.

LSTM-Transformer
The NOx emission prediction model proposed in this paper takes reference from the traditional Transformer and improves its structure, as shown in Figure 3. Inside the Transformer, LSTM is embedded in a parallel structure.These structural improvements have the following specific effects:

•
Long-term dependencies are modeled using a self-attention mechanism, and shortterm dependencies are modeled using LSTM, thus simultaneously focusing on the repetitive patterns of the time series data in the long-and short-term.

•
The sin-cos position encoding method only considers the distance relationship but not the direction relationship, both of which are equally important for time series prediction tasks.And from the structure, LSTM has the feature of inputting and transmitting information sequentially in a time sequence, so LSTM can be used to learn the distance and direction information of the input data.

•
The LSTM encoding can maintain the continuity of time series data in time, thus reducing the decrease in model accuracy caused by the attention mechanism that disrupts the continuity of time series data.• Parallel-designed structures can improve computational efficiency.

(a) Encoder
The encoder is composed of a stack of N identical encoder layers.Each encoder layer contains three main layers.The first is an LSTM network, the second is a multi-head self-attention, and the third is a feedforward layer.The residual connection layer [32] and layer normalization [33] are used around these layers.The overall equations for the l th encoder layer are summarized as X l en = Encoder X l−1 en .Details are shown as follows: where X 0 en = X en , X en ∈ R L×d model represents the historical input sequence with input step length L. X l lstm,en ∈ R L×d model represents the output after LSTM encoding in the l th encoder layer.X l mh,en ∈ R L×d model represents the output after the first multi-head attention layer in the l th encoder layer.X l en ∈ R L×d model represents the output of the l th encoder layer.M head represents the multi-head self-attention mechanism.

(b) Decoder
The decoder is also composed of a stack of N identical decoder layers.The structure is similar to that of the encoder.The overall equation for the l th decoder layer can be summarized as X l de = Decoder X l−1 de , X N en .The decoder can be formalized as follows: X l de = LayerNorm X l mh,de + M head X l mh,de , X N en (15) where X 0 de = X de .X l mh,de ∈ R L×d model represents the output after the first multi-head attention layer in the l th decoder layer.X l de ∈ R L×d model represents the output of the l th decoder layer.

(c) Output layer
After the decoder decodes the feature vector, it is passed through a fully connected feedforward layer and then a linear layer to obtain the predicted output.The definition is as follows: where W and b represent the trainable weight matrix and bias vector, respectively, and Linear represents the linear layer.The predicted output is Y pred ∈ L×d y , and in this paper, In summary, determining the input step size L and then selecting d-specific characteristic variables from the rotary kiln as model inputs X ∈ R L×d .Transforming the data dimension to the model dimension using a nonlinear mapping, and then after position encoding, we obtain the LSTM-Transformer's original inputs X en ∈ R L×d model .For the nonlinear mapping, we set the activation function as ReLU.After entering the encoder layer, this input simultaneously enters the multi-head attention layer and LSTM layer.Both do not interfere with each other and perform calculations simultaneously.The output X mh,en ∈ R L×d model is obtained after the computation of multi-head attention layer and can be used to model long-term dependencies.The output X lstm,en ∈ R L×d model is obtained by encoding in the LSTM layer, which can be used to model short-term dependencies and to learn the position information of the input information and maintain the continuity of the data.After both are computed, they are added and layer-normalized to obtain the feedforward layer input.Compared to the serial computing structure, the parallel computing structure is designed to retain the high computational efficiency of the original Transformer, while utilizing the LSTM to add additional information to improve performance in time series prediction.The feedforward layer is used to increase the nonlinear capability of the model, and the residual network and layer normalization are used to optimize the model.The decoder layer had a similar structure to the encoder layer.The position encoding is calculated as follows: where pos is the sequence length index, i is the dimension index.

Data Preprocessing 4.1. Datasets
The research data used in this paper came from Zhongzhou Aluminum Plant, Jiaozuo City, Henan Province, China.We obtained the data directly from the distributed control system (DCS) in the field.To ensure the coverage of most working conditions during the operation of the alumina rotary kiln and to improve the reliability of the experimental results, we used two datasets with different sampling intervals for the training and testing of the prediction model.One dataset contains 8460 samples with a sampling interval of 10 s, covering the 24 h operating history of the studied subject, which we named "Data 10 s".The other dataset contains 8460 samples with a sampling interval of 30 s, covering the three-day running history of the studied subject, which we named "Data 30 s".We followed standard protocol to split all datasets into training, validation, and test set in the ratio of 6:2:2.

Outlier Detection and Missing Values Handling
This paper uses the box plot method to detect the outliers [34].The greatest advantage of box plots is that they are not affected by outliers, and can accurately and consistently describe the discrete distribution of data, while also facilitating data cleaning.For outliers, we treat them as missing values because the missing values can be filled in using information from existing variables.
For missing values, we did not remove them directly, as this would lead to loss of information.We used the KNN imputer to fill in the missing values.It finds several most similar historical data in the missing value annex and fills in the missing values.

Feature Variable Selection
According to the production process of alumina rotary kiln and the generation mechanism of NOx, combined with the advice of experts on site, we finally selected thirteen variables as inputs based on the measuring devices available at the industrial site: total air rate, twin-tube rotation speed, kiln head temperature, kiln tail temperature, burning zone temperature, blower rotation speed, smoke evacuator variable frequency current (two), oxygen content, incoming kiln slurry pressure, incoming kiln slurry flow rate, and past NOx emissions as the input variables of the prediction model.

Data Standardization
This paper uses Z-score normalization for each input variable.The formula is as follows: where µ and σ are the mean and standard deviation of the variable in the training set.

Evaluation Metrics
In this paper, the root-mean-squared error (RMSE) and mean absolute percentage error (MAPE) are used to evaluate the model prediction quality as evaluation metrics, which can be defined as follows: where y i is the actual value, y i is the predicted value, and n is the number of data samples.

Baselines
In this paper, seven prediction models are selected for comparison, including Transformer, CEEMDAN-AM-LSTM, S2S-AM-LSTM, CNN-LSTM, LSTM, BPNN, and SVM.Transformer changes the output layer network structure because it is a regression problem rather than a classification problem.S2S-AM-LSTM is defined as an LSTM-based encoder-decoder network with an attention mechanism.

Implementation Details
Our proposed model contains three encoder layers and three decoder layers.During the training period, mean square error (MSE) is used as the loss function.ADAM [35] is used as the optimizer, where β 1 = 0.9, β 2 = 0.98 and = 10 −8 .The learning rate is 0.001, the dropout is 0.1, the attention heads number is 1, and the model dimension is 128.For Data 10 s, the batch size is set to 256, and for Data 30 s, the batch size is set to 128.For the baseline models, the hyper-parameters were optimized by manual parameter adjustment or grid search to ensure the validity of the experimental results.The early stop strategy was used during training.

Results and Analysis
In this section, we will evaluate the effectiveness of LSTM-Transformer for NOx emission concentration prediction in two datasets with different sampling intervals.The results will be presented in the form of tables and figures.

NOx Concentration Emission Prediction
To compare the performance of the model in long-term prediction for different future time horizons, we set the input length I = 36, and the prediction distance length O: 6, 12, 24, 48.The best result is shown in bold.
Based on the above results, the following could be observed: (1) In the long-term prediction task, LSTM-Transformer significantly improves the prediction performance in both datasets with different sampling intervals.This demonstrates the success of the proposed model in enhancing long-term time series prediction capability.
(2) LSTM-Transformer has better prediction accuracy than Transformer.The reason for this is that LSTM can provide fine-grained short-term trend information and provide position information.This demonstrates the effectiveness of the structure we designed.(3) The increase in the sampling interval time may ignore some changes in the data during this increased time, which leads to the loss of information.This is the main reason for the degradation of the model performance.Notably, the LSTM-Transformer still has a better prediction accuracy as the sampling interval time increases.It means that the LSTM-Transformer has better robustness, which is meaningful for the accurate long-term prediction of NOx emission concentration.(4) The transformer-based model has a better prediction accuracy.This demonstrates the advantage of the self-attention in capturing long-term dependencies, as the selfattention makes the path of signaling as short as possible.( 5) CEEMDAN-AM-LSTM has better performance in LSTM-based models and shows a similar prediction accuracy as Transformer.This demonstrates the effectiveness of the CEEMDAN method in time series preprocessing.We speculate that combining CEEMDAN with the Transformer might have good results.(6) We also find that LSTM-Transformer, Transformer, CEEMDAN-AM-LSTM, and S2S-AM-LSTM gradually deteriorate with regard to prediction accuracy as the prediction distance increases.This is due to the limitations of the encoder-decoder architecture, which suffers from error accumulation when implementing dynamic decoding inference.

Analysis of Generalization Capacity
Considering the practical application, the model needs to be adapted to the different working conditions of the rotary kiln during operation.For Data 30 s, the test set coverage time length is 14.4 h.In order to adequately verify the generalization performance of the LSTM-Transformer, we additionally select other data of different time periods to test the model.The test set sampling interval is still set to 30 s, and the length of coverage time is increased to: 18 h, 24 h, and 36 h.
The results of the generalization performance tests are shown in Table 3.The results show that the prediction performance of LSTM-Transformer remains stable as test set coverage time length increases.This means that the model can be well adapted to different working conditions during operation.It also has good adaptability to small external perturbations, such as external operations by site staff.This proves that the model has strong generalization ability.

Conclusions
This paper studies the problem of the long-term prediction of NOx emission concentration, which is a pressing demand due to environmental issues.This paper takes an alumina rotary kiln as the research object and proposes LSTM-Transformer for long-term NOx emission prediction.Specifically, a model structure is designed that focuses on both long-term and short-term trends, which can efficiently capture historical trend information for long-term prediction.In the comparison of long-term prediction performance, LSTM-Transformer has better results than the rest of the baseline models.Compared to the baseline model, LSTM-Transformer yielded a 28.2% and 19.1% average RMSE reduction on two datasets with different sampling intervals, respectively.
However, LSTM-Transformer may have some limitations in predicting longer distances.First, due to the large complexity of the standard dot-product self-attention, longer prediction lengths can result in prediction failures due to out-of-memory and are limited by computational and memory resources.Second, traditional dynamic decoding inference methods suffer from error accumulation in long-sequence prediction and consume a lot of inference time.Therefore, in future work, we will focus on sparse self-attention to reduce complexity and memory usage while trying to improve the inference structure in order to make the model suitable for long-sequence prediction tasks.Ultimately, we hope to combine the prediction model with the intelligent control system to build an intelligent rotary kiln flue gas denitrification system.

Table 1 .
Prediction results of different models on Data 10 s.

Table 2 .
Prediction results of different models on Data 30 s.

Table 3 .
Prediction results of LSTM-Transformer with different test set coverage lengths.