Feature Selection and Uncertainty Analysis for Bubbling Fluidized Bed Oxy-Fuel Combustion Data

This paper presents a novel feature extraction and validation technique for data-driven prediction of oxy-fuel combustion emissions in a bubbling fluidized bed experimental facility. The experimental data were analyzed and preprocessed to minimize the size of the data set while preserving patterns and variance and to find an optimal configuration of the feature vector. The Boruta Feature Selection Algorithm (BFSA) finds feature vector’s configuration and the Multiscale False Neighbours Analysis (MSFNA) is newly extended and proposed to validate the BFSA’s design for emission prediction to assure minimal uncertainty in mapping between feature vectors and corresponding outputs. The finding is that the standalone BFSA does not reflect various sampling period setups that appeared significantly influencing the false neighborhood in the design of feature vectors for possible emission prediction, and MSFNA resolves that.


Introduction
In general, oxy-fuel combustion is one of the 'carbon capture' technologies that aim to reduce CO x emissions (mainly) from coal power plants [1] and also within other fuel combustion systems such as biomass [2] that is still prospective for utilization and research. The fuel is oxidized in almost pure oxygen, unlike the air in the traditional air combustion, and therefore the flue gas consists primarily of carbon dioxide and water vapor [3,4].
Combustion processes are complex, nonlinear, and their measurements generate complex data that requires proper analysis for further design of prediction and control systems.
Computational intelligence approaches with machine learning have been attractive to study for combustion processes in the last decades, e.g., [5], where multilayer feedforward neural network with error back-propagation learning was used for approximation of measured CO/lambda biomass combustion dependence and that shows a significant variance in data that can be seen as a kind of uncertainty, and the neural network is used to extract the prevailing dependence in data; further example of filtration with membership function design can be found in [6]. Further computational intelligence techniques based on immune systems and applied to biomass combustion and that highlights the nonlinearity and complexity of the process can be found in [7]. Predictions of NO x emissions from biomass-fired combustion process using experimentally established dataset of flame radical images and deep learning was presented in [8], where also Morphological Component Analysis and region-of-interest extraction were used to reduce the dataset size and improve the clarity and unambiguity of the samples.
As equivalent inputs, digital flame images were used in [9], where the successful operation condition recognition was achieved by building a combined Principal Component Analysis and Random Weight Network model and estimating optimal model parameters using cross-validation (PCA-RWN).
One of the traditional approaches to CO x emission reduction is presented in [10], where the emissions, caused by temporary fluctuations in fuel feed of the wood pellets, were compensated by precise control of primary and secondary airflow (air staging).
Apart from emission prediction or process control itself, feature selection techniques represent a significant part of research with a similar topic. A feature selection technique evaluating the features by comparing the MAE (Mean Absolute Error) between the real and predicted value of different subsets of primary and secondary variables from the original dataset can be found in [11]. Selected features were used to build a model of a brown coal-fired boiler to predict fresh steam properties for a suitable combination of input parameters. An analogous approach was also used for short-term forecasting of CO 2 emission intensity [12], where a forward selection algorithm was applied after removing highly correlated variables based on Pearson's correlation coefficient and LASSO regression. Apart from Pearson's correlation coefficient, a statistic popular, especially for large data, is MIC (Maximal Information Coefficient [13]) that can capture both functional and nonfunctional relationships and gives similar scores to equally noisy relationships. MIC was used by [14] for NO x emission prediction with Long Short-Term Memory (LSTM).
With prospects for future research and design of a machine-learning-based prediction system of emissions from the oxyfuel system, we propose a technique that designs a feature vector based on random forests in combination with further validation via uncertainty analysis by Multiscale False Neighbour analysis (MSFNA) [15]. In this study, a feature vector is proposed by Boruta Feature Selection Algorithm (BFSA) [16], which combines two powerful tools for feature selection: decision trees and z-score.
Mathematical notation of the variables is the following: small letters, such as "x", are for scalars, bold "x" for vectors and bold capital "X" for matrices. Lower indexes, such as "x i " usually indicate the position of the element in the vector x (except for u and y), i in "x i " denotes ith row of a matrix, u stands for a process (control) input variable, y denotes process output variables, and u i and y j denotes ith process input variable and jth process variable respectively (see Table 1). Further in text, "x" denotes feature vector; notice terms "feature vector", "state vector", and "input vector" have the same meaning and are used interchangeably in this paper. Capital "N" denotes the length of data (the count of all processed data samples), and "n" is generally used to denote other quantities such as "n x " for the length of the vector x, or n c for the number of applied (principal) components. Exclusively, r x and r y denote radii vectors in state space of state vectors x and output variable y, respectively. At last, u i [k − n ui : k − 1] denotes the vector of last measured step delays of variable u i (the vector length is n ui ).

Feature Selection
Feature selection is the very fundamental process for classification and prediction. With the decreasing price of data acquisition systems, the amount of measured variables is increasing and so is the need for feature selection and data reduction. Feature selection methods are traditionally divided into three groups as follows [17,18]: Filter methods use various techniques (correlation, Mutual Information, chi-squared test, etc.) to score the variables and compare the scores against the threshold to filter out features.

2.
Wrapper methods apply different machine learning algorithms to get predictor performance as an evaluation of variables. The main disadvantages are the high chances of overfitting and longer computation time.

3.
Embedded methods aim to reduce computation time by integrating feature selection as part of a training process. Typical representatives are LASSO and Elastic Net.
In studies with similar topics regarding emission prediction, there are three most common approaches: • using Principal Component Analysis to reduce the size of an input data set [9,19] , • searching for statistical dependencies between features [20,21], • filtering out the outliers and irrelevant data points [11,22,23].
The current state of art of feature selection algorithms is focused on static mapping functions [24]. In this paper, we apply the BFSA algorithm to reduce the unnecessary length of a state vector of a discrete-time dynamic system with constant sampling, so it also identifies the optimal configuration of step-delayed variables. Furthermore, the feature selection, performed by the BFSA, is proposed to verify the results via the MSFNA, where MSFNA appears useful for the sampling period validation. The use of BFSA and the MSFNA is recalled in the following subsections. BFSA is a wrapper feature selection method presented in [16]. The algorithm searches all important features in the existing data set using a random forest algorithm to identify the importance of features.
The main idea behind the BFSA algorithm is as follows: if there exists a feature, that scores lower than the best of the shadow features, the feature is not relevant for the system (unsuccessful performance).
First, BFSA creates shadow features from the existing ones by randomly shuffling the original values of a feature matrix X along the column, as follows where x is a state (feature, input) vector composed of actual input variables, their stepdelayed values, and step-delayed output variables whose configuration is to be found by BFSA and validated by MSFNA. A random forest classifier is used as a predictor that creates a model extended by shadow features, i.e., by additional and irrelevant ones. The random forest model is thereupon used for feature performance evaluation. For each feature, z-scores are computed as the average accuracy loss of the prediction divided by its standard deviation [16]. Feature importance is estimated by comparing its z-score against the smallest z-score value of the shadow features. This value is used as a threshold because the shadow feature importance is nonzero only due to random fluctuations and is not beneficial for the system at all. The process is repeated for a predefined number of trials and the number of successful and unsuccessful performances of each feature is computed.
The total number of successful or unsuccessful performances, that each feature must reach, is driven by the binomial distribution as follows where is the number of trials, p ∈ 0, 1 is the success probability for each trial and s is the total number of successful trials, so it defines confidence intervals as in the example in Figure 1. The number of successful performances of a feature, that is to be labeled as important for the data, is given only by the number of trials used. Successful features then create a new data set. Using 20 trials, the feature can be labeled as important if it scores higher than shadow features at least 16 times.

Multiscale False Neighbours Analysis (MSFNA)
The MSFNA is a technique used for the uncertainty evaluation of the state vector's configuration as it evaluates the determinism for general input-output mapping f (x) = y, i.e., the mapping f () between feature vectors x and corresponding outputs y [15]. Unlike common false neighbor analysis, which requires precise knowledge about the neighborhood radius, see [25,26], MSFNA uses a vector of several radii to scale the neighborhood condition. If x is the state vector of the system, then system output y is unambiguously determined by the input (state vector) x if the same input x results in different outputs y, the state vector is not complete and other features must be included, these states are then referred to as false neighbors (FN). Two states x 1 , x 2 are FN if they meet the condition where r x , r y are absolute radii that define the condition of state similarity. Because optimal r x and r y is practically unknown, MSFNA utilizes the whole range of radii via heuristically designed vectors r x = [r x 1 , r x 2 , . . . , r x α ], r y = [r y 1 , r y 2 , . . . , r y β ] assuming that the optimal radii lies within the ranges of radii of vectors r x , r y . The Areal Cumulative False Neighbors ACFN is computed according to where ACFN is the sum of all false neighbors in the dataset that evaluates the uniqueness of the state vector configuration for each data point. The higher ACFN gets, the less determinism, i.e., more false neighbors, is in the mapping of the designed feature vector configuration and prediction outputs. To compare datasets with a different number of samples, we introduce the Relative ACFN (RACFN) that eliminates the dependence of ACFN on the number of samples in the dataset by normalization via the total count of all neighbor pairs as follows Apart from the number of samples, a dimension of the input matrix (state vector) is also a source of deceptive information arising from comparing euclidean distances of variously long state vectors against the same radii vectors r x and r y .
While (6) resolves the issue of various lenght of data, it still suffers from various lengths of feature vectors n x that BFSA can principally find for various datasets. Thus, we propose to resolve the non-unique feature-vector-length problem by compressing (columnwise) the matrix of state vectors X as in (1) using Principal Component Analysis (PCA) into a customized number of components n c , i.e., the matrix X is compressed to a constant number of columns (new features), and the MSFNA is then evaluated for state vectors of the same length n c . The PCA compression into n c features is applied as follows where v 1 , v 2 , . . . , v n c , are selected eigenvectors of covariance matrix C X = XX T . It can be noted, that the PCA compression prior RACFN not only standardizes its results by unifying the feature vector length for various feature setups, but it also naturally increases the robustness of the MSFNA results as the PCA compression further removes unique features because n c of the most significant components are used for compression.

Data and Preprocessing
The experimental device is a 30 kW laboratory-scale oxy-fuel BFB combustor, the scheme is in the Figure 2. Sunflower pellets were used as biomass fuel.
The variables used for the analysis are listed in Table 1. As input variables u i are denoted features, that can be changed or indirectly influenced by the facility set-up. Output variables y i consist of the volumetric fractions of the flue gas components.
The experiment was realized by measuring four states of the oxy-fuel mode defined by secondary and primary oxygen flow ratios In each mode, there were 900 up to 1100 samples measured, as shown in Table 2. All oxy-fuel modes were measured at the same environmental conditions in one day.  Collected data were first checked for missing values and relevant data points were removed and replaced with the previous value. Data were resampled with four different sampling constants to compare the results: ∆T 1 = 5 s, ∆T 2 = 10 s, ∆T 3 = 15 s and ∆T 4 = 25 s. The new point is calculated as a mean value of previous points, which behaves as a simple smoothing filter. In the next step, the data were normalized using the Min-Max scaler as shown in (8) and (9).
FR max and FR min is the upper, respectively lower interval limit of the desired feature range; we used 0, 1 . The last step is to design the state vector and matrix of states. The matrix of all state vectors X is defined as follows . . .
where the full X before applying BFSNA for selecting optimal features is defined as follows.
where the maximum number of step delays n ui and n yj are initialized heuristically, the meaning of input and output process variables is defined in Table 1. The snapshot of measured and min-max scored process variables u i and y j is shown in Figure 3.  Table 1.
For further details of the real device and the experimental setup, please kindly see [3,4,27].

Algorithm Complexity
The computational complexity of the BFSA is O(n x · N), where n x is the number of features and N is the number of samples [16]. The computational complexity of the Multiscale False neighbor analysis is O(n r · N), where n r = α is the length of MSFNA absolute radii vector, see the text bellow Equation (4).

Experimental Analysis and Results
In this section, experimental analysis on artificial data and then on real data is shown for the algorithms described in Section 2.

Artificial Data
The first goal was to test the proposed algorithms on an artificial dataset of a simple MISO (multiple-input, single-output) system given by equations where k = 1, 2, ...N = 250.
Two additional variables X [:,3] and X [:,4] were added to test the BFSA, these columns are a combination of a sine, resp. cosine curve and white noise. The artificial data from (12) are plotted in Figure 4, where X [:,3] and X [:,4] are added to demonstrate that BFSA eliminates them and MSFNA detects more false neighbors via higher RACFN.

Experimental Data
For each output variable, a state vector x was designed. The maximum number of history samples of each feature was set to 15, as in Equation (11). Results for sampling periods ∆T = [5,10,15,25] s for each output variable are shown in Tables 3-6; one can notice, that variables u 1 = ψ f g , u 2 = t run and u 3 = t d were not selected by BFSA as input features for any output variable y i . Primary air flow u 4 = V air,prim was used rather sparsely. The total number of state vector features is listed in the last row of Tables 3-6. Table 3. Table of state vectors x i for outputs y i (listed in columns) proposed by Boruta Feature Selection Algorithm for sampling period ∆T s = 5 s. Features u 1 , u 2 , u 3 were not picked by BFSA at all, u 8 is not chosen for y 5 . Variables Variables used for the input vector Variables  Table 4. Cont. Variables Variables used for the input vector  Table 5. Table of state vectors x i for outputs y i (listed in columns) proposed by Boruta Feature Selection Algorithm for sampling period ∆T s = 15 s. Features u 1 , u 2 , u 3 were not picked by BFSA at all, u 4 is not chosen for y 3 . Variables Variables used for the input vector  Variables Variables Variables used for the input vector The Multiscale False Neighbours analysis was performed with the same configuration for all samplings ∆T = [5,10,15,25] s. The step delays n ui and n yj were set as [5,10,15], resulting to 3 state vector configurations with 3 different numbers of features; n x = [65, 130,195]. The fourth examined state vector is proposed by BFSA. Results are shown in Figures 6-9. The RACFN was chosen as a comparative quantity due to the ability to compare the examined state vectors' data variance and graphically express the results. Orange columns are state vectors proposed by BFSA and blue columns are state vectors created according to (11) with n ui = n yj = [5, 10,15]. Axis x represents the number of features n x for specified y i .  (11) with n ui = n yj = [5, 10,15]. Axis x represents the number of features n x for specified y i .  (11) with n ui = n yj = [5, 10,15]. Axis x represents the number of features n x for specified y i .  (11) with n ui = n yj = [5, 10,15]. Axis x represents the number of features n x for specified y i .

Discussion
The major outcome of the feature selection by BFSA and its newly proposed validation of MSFNA is demonstrated via Figures 6-9, where the different sampling periods (∆T = [5,10,15,25] s) results in various levels of uncertainty (RACFN) between by-BFSAdesigned feature vectors and process outputs. In Figures 6 and 7, where the sampling is chosen faster (than in Figures 8 and 9), the BFSA proposes such feature vector configuration that displays more uncertainty via RACFN than other merely heuristic manual feature selections (Figures 6 and 7). For longer sampling period (Figures 8 and 9), the by-BSFNA-designed feature vectors already displays lowest uncertainty via RACFN as well. Recall, that all RACFN for all various feature vector configurations were evaluated by PCA compression of feature vectors to the equal length n c and that even naturally increases the robustness of the method as already discussed in Section 2.1.
Primary airflow V air,prim , recirculation flow V rec , and secondary air velocity v air,sec were selected by BFSA less often and differently for each sampling frequency that indicates their lower importance for emission prediction.
BFSA results, further validated by MSFNA, supports the assumption that the flue gas composition is strongly dependent on previous values of process output variables y i , and this corresponds to the high complexity of the combustion dynamics where the measured values provide us only with a minimum necessary information about the actual process, so the emission prediction is still a challenge.
With a specific sampling frequency, a shape similarity can be observed across all output variables. It is the result of the strong sampling frequency dependence of the dynamic systems.
A minimum RACFN was reached for the state vector designed by BFSA for longer sampling periods ∆T = 15 s and ∆T = 25 s. This finding corresponds to the assumptions of slower dynamical behavior of important phenomena in the real process for which we intend to predict the emissions (but in the following research as it would exceed the scope of this paper). It can be observed, that the uncertainty of the by-BFSA-designed features became improved for longer sampling of ∆T = 15 s (Figure 8), and it did not improve for longer samplings further (Figures 8 vs. 9). Thus, the combination of BFSA and MFSNA not only selects features but also suggests a proper shortest sampling (for which RACFN became the smallest out of the other non-BFSA feature designs).

Conclusions
The feature selection algorithm (BFSA) was studied for mapping between available measurements (features) and emissions (outputs) of bubbling fluidized bed oxy-fuel combustion data. Newly, the original uncertainty analysis (MFSNA) was extended (PCA & RACFN) and proposed to validate the BFSA in terms of uncertainty via false neighbors. The proposed technique is promising for the automatic configuration of feature vectors from measured data (for future prediction systems with machine learning). Furthermore, the combination of conventional BFSA with MSFNA appeared capable to clearly validate and propose a proper sampling period that is computationally difficult with standalone BFSA otherwise.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available.