Combustion Instability Monitoring through Deep-Learning-Based Classiﬁcation of Sequential High-Speed Flame Images

: In this study, novel deep learning models based on high-speed ﬂame images are proposed to diagnose the combustion instability of a gas turbine. Two different network layers that can be combined with any existing backbone network are established—(1) An early-fusion layer that can learn to extract the power spectral density of subsequent image frames, which is time-invariant under certain conditions. (2) A late-fusion layer which combines the outputs of a backbone network at different time steps to predict the current combustion state. The performance of the proposed models is validated by the dataset of high speed ﬂame images, which have been obtained in a gas turbine combustor during the transient process from stable condition to unstable condition and vice versa. Excellent performance is achieved for all test cases with high accuracy of 95.1–98.6% and a short processing time of 5.2–12.2 ms. Interestingly, simply increasing the number of input images is as competitive as combining the proposed early-fusion layer to a backbone network. In addition, using handcrafted weights for the late-fusion layer is shown to be more effective than using learned weights. From the results, the best combination is selected as the ResNet-18 model combined with our proposed fusion layers over 16 time-steps. The proposed deep learning method is proven as a potential tool for combustion instability identiﬁcation and expected to be a promising tool for combustion instability prediction as well.


Introduction
Combustion instability (CI) causes large oscillations of dynamic pressure and heat release, which can damage hot gas path components such as a combustor itself and turbine blades and vanes located at the downstream. Thus, CI has been one of the main issues for many combustion engines including gas turbines and internal engines. Although tremendous efforts have been exerted to notify and characterize the CI mechanisms [1][2][3][4][5] and to monitor and control the CI [6][7][8][9][10][11], the remorseless reinforcement of NOx emission regulation has made gas turbines to operate in ultra-lean premixed condition. This premixed combustion near the blowout limit makes safe operation of gas turbines uneasy by leading flame to oscillate, and, thus, amplifying CI [12,13]. However, the good news is that we are in the era of the 4th generation revolution, and a new paradigm of CI monitoring technique is being applied with the advent of advanced machine learning techniques.
Especially, with the advance of optical sensing and digital image processing methods, digital imaging devices have been employed in on-line combustion-state monitoring systems [14][15][16][17][18][19][20][21][22][23]. For example, Lu et al. extracted from flame images a variety of handcrafted features such as the ignition point and area, and the flickering of flame [14]. The variations of the features with furnace load were discovered in their work. Sun et al. defined a simple universal index that assesses the stability of flame in terms of color, geometry, and luminance [17]. The feasibility of oscillation frequency as an indicator of flame stability was also studied. Chen et al. proposed a unique processing algorithm based on multiway principal component analysis (MPCA) and a hidden Markov model (HMM) for monitoring of the combustion state in an industrial furnace [15,16]. From the MPCA-HMM analysis, spatial and temporal features of dynamic flame images were extracted and the feasibility of their model was verified by successfully detecting fault situations in a time-series control chart. Bai et al. segmented a flame image into premixed and diffused regions in which color and texture features were extracted [19]. They applied principal component analysis (PCA) to the features and then used a kernel support vector machine to identify the operation condition.
The advances in this digital image processing technique are accelerated by combining with machine learning methods, which can learn important features automatically. Thus, deep-learning-based methods of machine learning have recently been used for combustion state monitoring and identification of combustion instability onset [18,[20][21][22][23]. Wang et al. proposed a convolutional neural network (CNN) to monitor furnace combustion states and predict the heat release rate [20]. The CNN showed higher accuracy than traditional methods based on handcrafted features and learned classifiers. Liu et al. proposed a deep belief network model (BNM) which can simply construct the complex nonlinear relationship between the flame images and the outlet oxygen content [21]. Their model was compared with a conventional linear feature extraction method of PCA and better performance of this model was verified through the demonstration of on-site tests in a real combustion system by proving not only the possibility of nonlinear feature extraction but also the higher accuracy of prediction with 5.543% relative error. They used a supervised learning framework to regress the oxygen content of combustion systems, which requires a large number of labeled images.
These labels are often given by manual labor. To reduce the laborious task, Akintayo et al. proposed a semi-supervised learning approach in which only stable flame images are assumed to be labeled [18]. A convolutional selective autoencoder was used to learn flame features, and the statistics of stable flame features were used to detect unstable flames. To avoid manual labeling, Qiu et al. used a convolutional autoencoder and PCA for feature extraction and dimension reduction [22]. An HMM was then used to model the transition probability between states and the conditional probability of features given a state. Han et al. proposed an unsupervised learning-based approach. In their approach, flame images are transformed into features by a stacked sparse autoencoder [23]. The training labels are then automatically determined by clustering the features and applying statistical analysis to handcrafted features [14]. These approaches are useful if the dataset consists only of images [22,23] or the labels of the images are partially known [18]; however, the approaches are prone to error if the unsupervised or semi-supervised features are not well separated across different combustion states.
In this paper, we propose a deep supervised learning-based method for combustion state prediction. An abundance of training data is obtained without manual labor by synchronized high-speed flame imaging and dynamic pressure measurement. A single model is trained to deal with multi-mode flame image sequences whose scale is unprecedented. Unlike previous network architectures [18,[20][21][22][23]] that take single images as input, our proposed method takes subsequent image frames and predicts the combustion state of the last frame. We propose two different network layers that can be combined with any backbone network to improve the performance. Our proposed early-fusion layer can learn to extract the power spectral density of subsequent image frames, which is time-invariant under certain conditions. Instead of modeling complicated temporal behavior of flames using a transition probability across neighboring time steps [15,22], this study proposes a late-fusion layer that can combine long-range outputs effectively to maximize the classification accuracy. The best combination of our proposed layers with our adopted backbone network was determined through the validation test of which dataset is obtained from a gas turbine combustion experiment for six cases to ensure the robustness of prediction for various kinds of combustion instability.

ResNet-Based Models for Sequential Flame Images
Our proposed method takes subsequent flame image frames as input and produces the combustion state of the last frame. ResNet [24], which is the champion architecture that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015, serves as a backbone network. The proposed model architecture is composed of two additional sequential layers. The early-fusion layer provides the ResNet with input features extracted from the image frames so that the ResNet will extract deeper features and predict the combustion state better. The late-fusion layer combines the ResNet outputs at different time steps based on the assumption that previous outputs can help in improving the current output. Figure 1 shows the proposed model architecture including the early-fusion layer, the ResNet, and the late-fusion layer. The input to the early-fusion layer is a sequence of T subsequent image frames and the output is a feature map with 4T channels. In this paper, we vary T from 1 to 16. Based on the feature map, the ResNet provides the probabilities of three different combustion states: stable, transient, and unstable states. The probabilities at the current time step τ as well as those at previous time steps are used as input to the late-fusion layer to obtain the final state probabilities. Figure 1 depicts a model with two previous times steps τ − 1 and τ − 2. The combustion state at τ is then determined as the one maximizing the probability from the late-fusion layer. The following subsections describe more details of the model. The colored circles represent probabilities corresponding to three different combustion states: stable, transient, and unstable states colored in blue, green, and red, respectively. Please refer to the text for more detail. Best viewed in color.

Convolutional Neural Network
Before introducing the details of the proposed model, an understanding of the convolutional neural network and ResNet is necessary since they are used for the model configuration. Convolution has been widely used to extract image features such as edges [25], corners [26], and affine covariant regions [27]. Since an image is represented by a 2D grid of pixels, 2D convolution is performed on an image by applying a linear filter to all pixels. The filter size typically ranges from 1 × 1 pixels to 7 × 7 pixels. For image I with C channels, the 2D convolution operation is defined as follows: where (i, j) is the pixel coordinate vector and c is the channel index. I(c, i + u, j + v) is the intensity of I at (c, i + u, j + v). The filter size is (2U + 1) × (2V + 1) pixels and W is the filter weight. Finally, H F (i, j) is the filter response, or the feature, at pixel location (i, j). In the traditional feature extraction methods [25][26][27], handcrafted filter weights have been used along with a zero bias term b = 0. In deep learning, both W and b are learned from data. Different filters can be applied to obtain a feature map with multiple channels. For example, we can apply K filters to an image to obtain feature map H F with K channels. In this case, Equation (1) can be rewritten as follows: where k is the filter index ranging from 0 to K − 1. Considering a feature map X as an image, we can apply 2D convolution to X to obtain another feature map Y.
To reduce redundant features, the filters can be applied to pixels sampled with strides greater than 1. Or otherwise, the strongest features in small neighborhoods may be pooled, which will also reduce the feature map size. This operation is called max-pooling. With the reduced memory burden, the network can grow deeper.
Training a deep neural network changes the parameters of all layers. With a large learning rate, the distribution of each layer's input greatly changes, complicating the training. Thus, the learning rate needs to be small, leading to a long training time. With batch normalization [28], the input distribution is stabilized and the learning rate can be increased to reduce the training time. For a mini-batch with M training samples, batch normalization whitens each channel H F (k) using the mean E[H F (k)] and variance Var[H F (k)] over the mini-batch, as follows: where is a small positive number preventing the denominator from becoming zero. By the whitening, the mean and variance of H N (k) becomes 0 and 1, respectively, resulting in unwanted loss of representation power. To recover the representation power, the following linear transformation, with learnable parameters γ and β, is applied.
When using batch normalization, the bias term b or b(k) is often omitted because β replaces the role of the bias term.
Applying linear filters repeatedly to an image is the same as applying their composite linear filter. Thus, such a pure deep convolutional network can hardly learn nonlinearity in data. To inject nonlinearity, the following rectified linear unit (ReLU) function is applied to the linear features.
The ReLU function suffers less from the vanishing gradient problem [29], so it is helpful for increasing the depth of neural networks. In recent neural network architectures, convolution blocks are typically defined by a series of convolution, batch normalization, and ReLU, as we reviewed in this subsection.

ResNet
The depth or the number of sequential layers of convolutional neural networks is of crucial importance in increasing the performance. However, adding more layers to a suitably deep model leads to higher training error. He et al. showed that this degradation problem is not caused by overfitting; they proposed convolution blocks with shortcut connections to reduce the degradation problem [24].
Let us denote a feature map by X and an underlying mapping to be fit by a few stacked layers by H(X). In residual learning, the residual mapping F (X) such that F (X) = H(X) − X is learned, and H is recovered by where Y is the output feature map of the mapping. If all the weights of F (X) are zero, then H(X) becomes an identity mapping such that Y = X.
The identity mapping is implemented by using a shortcut connection, which directly connects the input to the output as shown in Figure 2. He et al. showed that it is easier to optimize a deep convolutional neural network based on building blocks with shortcut connections [24].  [24]. ResNet-50 also has an affordable training time for limited computational resources. Figure 3 shows ResNet-18 and ResNet-50 architectures. For both architectures, the input image has a size of 224 × 224 pixels. 7 × 7 convolution with a stride of 2 is applied to the input image, resulting in a feature map with a size of 112 × 112 pixels. The feature map is then max-pooled to a size of 56 × 56 pixels. To the max-pooled feature map, a series of convolution blocks are applied. When the strided convolution is applied to a feature map, "/2" is used to note that. In each block, only the first 3 × 3 convolution has a stride of 2. In this case, the shortcut connection is replaced with 1 × 1 convolution with a stride of 2 to match the reduced size and increased dimension. The final feature map size is 7 × 7 pixels. The numbers of channels are 512 and 2048 for ResNet-18 and ResNet-50, respectively. The final feature map is average-pooled to obtain a 512-or 2048-dimensional vector, that is, each 7 × 7 channel is reduced to 1 × 1 by computing the average of the channel. For the ILSVRC, a 1000-dimensional fully connected layer is connected to the feature vector to output one of 1000 class objects. In this paper, a 3-dimensional fully connected layer (fc 3 in Figure 3) is used to output one of the three combustion states. ResNet-50. The number to the right of each "block"is the number of filters m. "/2" represents that the feature map height and width are halved by convolution with a stride of 2 or by max-pooling. In every block with "/2", only the first 3 × 3 convolution has a stride of 2. Refer to the text for more detail.
To model the probability of each state, the softmax function is applied to the output of the fully connected layer. Denoting the set of states by S = {stable, transient, unstable}, the softmax output is computed as where o(s) is the output of the fully connected layer for state s. The softmax output can be denoted as vector y, whose components are y(s) for s ∈ S. Finally, the state is determined by finding s that maximizes y(s).

Early Fusion
The simplest way to use subsequent image frames as input to a ResNet is to concatenate the image frames into a single multichannel image J. Denoting the image at time step t by I (t) , we concatenate T images for t = τ − T + 1, . . . , τ as for c = 0, . . . , T − 1. Because our flame images are of a single channel, each time step corresponds to a channel in J (τ) . In this paper, we vary T from 1 to 16. When T = 1, the output is the same as using a ResNet for a single image I (τ) . This will be our baseline method. When T > 1, our task becomes to infer the state at τ using the images from I (τ−T+1) to I (τ) .
When combustion instability occurs, flame images rapidly change with time. It has been discovered that the change is periodic due to the periodical evolution and decaying of heat release, which is derived from the periodic oscillation velocity or equivalence of mixture and also affected by the periodic fluctuation of dynamic pressure inside the combustion chamber [5,10]. Based on the periodicity, one may want to transform the time-domain representation to a frequency-domain representation. For example, we can apply the discrete Fourier transform [30] at each pixel location as follows: where Re and Im are the real and imaginary components of the frequency-domain signal.
In Equation (10), each k corresponds to frequency 2πkc f /T, where f is the image sampling frequency, for example, in our work 8 kHz. We can compute the power spectral density (PS) for the kth frequency as The power spectral density is time-invariant as long as the period of the image signal coincides with T [30]. Thus, training of a ResNet can be more stable with such an invariant input.
Because the period depends on various factors such as fuel composition or heat input, it may differ from sequence to sequence and from state to state. Besides, applying different T to different sequences will reduce the generality of the proposed method. In this paper we fix T for all sequences and learn the transform from data.
The computations in Equation (10) are both in the form of which is 1 × 1 convolution with T filters on J (τ) . Let us assume that the exact period T exact is less than T. If the learned weight W(k, c) is cos(2πkc/T exact ) for c = 0, . . . , T exact − 1 and otherwise 0, then Equation (12) is equivalent to the real part of the discrete Fourier transform with a period of T exact . We can analogously show the equivalence to the imaginary part, concluding that 1 × 1 convolution is capable of learning the discrete Fourier transform with an exact period. Although T exact can vary from sequence to sequence, 1 × 1 convolution has a chance of learning the best T exact or the best weights that will result in the highest classification accuracy in an average sense.
To implement the block capable of extracting the power spectral density, we first apply two 1 × 1 convolutions with T filters to J (τ) . Denoting the two resulting feature maps by Re and Im, the power spectral density map PS is calculated as in Equation (11). J, Re, Im, and PS are then concatenated into a 4T-channel feature map, which is used as input to a ResNet. This ResNet is referred to as PS-ResNet in the subsequent sections. Figure 4 depicts the PS-ResNet. The two feature maps obtained by 1 × 1 convolutions are squared and added together to produce a feature map PS that can represent the per-pixel power spectral density. The input image sequence J (τ) and the three feature maps Re, Im and PS are then concatenated to produce the input to a ResNet. In the figure, the "S" circle and the "C" circle represent the square operation and the concatenation operation, respectively. We note that neither batch normalization nor ReLU is applied after the convolutions. Please refer to the text for more detail.
On the other hand, the first layer of a ResNet is a 7 × 7 convolution layer. Because 7 × 7 convolution can represent 1 × 1 convolution, the 7 × 7 convolution layer is capable of learning to extract Re and Im. Indeed, a ResNet and any other CNNs are capable of learning to extract frequency-domain features. The main difference of a PS-ResNet from a ResNet is the explicit computation of the power spectral density. If the power spectral density is not a vital feature, the performance of the two architectures may be similar.

Late Fusion
Another way to use subsequent image frames is to provide each image frame as input to a ResNet and combine the outputs to determine the state of the last frame. For t = τ − T + 1, . . . , τ, a ResNet can return T softmax outputs y (t) . The separate output vectors are then concatenated into a single 3T-dimensional vectorŷ. By employing a fully connected layer, we can combine the separate pieces of information as follows:ỹ where w(s, n) denote the weights of the fully connected layer. By findings that maximizesỹ (τ) (s), the state at τ is determined. Figure 5 illustrates our late fusion approach. There are two different strategies to determine w(s, n) and d(s). The first is to learn the parameters from data. The second is to set the parameters manually. If we want the contributions from all outputs to be the same, then the weights can be determined as follows: considering that the stable, transient and unstable states correspond to labels 0, 1 and 2, respectively. For example, with T = 2, w(s, n) can be represented in a matrix form as: where the rows and columns correspond to s and n, respectively. In Section 3, we test both strategies. Recurrent neural networks (RNNs) have been used to analyze a variety of time-series data [28]. Training of RNNs relies on backpropagation through time (BPTT) in which the required computing resources increase with T. In addition, the vanishing and exploding gradient problems prevent RNNs from using a long T. There exist RNNs [31,32] that suffer less from the gradient problems by gating the input and hidden states. However, the gating technique does not mitigate the required computing resources and training time.
Our approach is similar with the teacher-forcing RNN [28] in which not the previous hidden states but the previous outputs affect the current hidden state. By using the target outputs or the outputs of the current model, BPTT can be avoided. In our approach,ỹ (τ) is connected to y (t) instead ofỹ (t) . This prevents a single false prediction ofỹ (t) from affecting the entire remaining predictions.
Our early fusion and late fusion approaches can be combined together to produce more accurate predictions. In Figure 5, we can replace the ResNet and I (t) with the PS-ResNet and J (t) , respectively. In this manner, the prediction at time step τ depends on 2T − 1 images effectively. On the other hand, the two time periods T in the early and late fusion approaches do not have to be the same. To distinguish T in the late fusion from that in the early fusion, we write T as T l or T e using the subscripts for late or early from now on.

Model Validation Test
This section provides experimental results for the validation of the proposed models. We first introduce our gas turbine combustor and measurement system and then our multi-mode dataset consisting of image sequences acquired under different combustion conditions.

Model Gas Turbine Combustor and Measurements
A model gas turbine combustor, which forms partially-premixed flames to burn hydrogen and methane syngas without flashback risk, is used for this study. Figure 6 provides a schematic of the experimental setup for high-speed flame imaging and dynamic pressure measurement. Fuel gas is supplied from the central swirling nozzle and injected to the air within the inner side of swirl vanes at 2.7 mm upstream from the dump plane. The cylindrical combustor liner is composed of a 200 mm quartz part and a 1440 mm steel part. The flame was visualized through the quartz part liner of which diameter (D) is 130 mm and length (L) is 200 mm and the liner was cooled by compressed air to allow safe optical visualization of the flame. A water-cooled plug nozzle, which has approximately 90% of the blockage ratio of the combustor cross-sectional area is installed at the end of the combustor to form acoustically closed boundary conditions. The dynamic pressure, which is used to monitor the level of combustion instability, is measured using piezoelectric dynamic pressure sensors (PCB-102A05, PCB Piezotronics, NY, USA) at the dump plane as depicted in Figure 6. High-speed OH chemiluminescence flame images are obtained through bandpass filters of WG-305 and UG-11 by a high-speed intensified CMOS camera (High-speed star8, Lavision, Göttingen, Germany). The recording time is synchronized with the dynamic pressure signal to observe flame structure according to the dynamic pressure oscillations. The images are taken during 1 second at 8 kHz with the resolution of 784 × 784 pixels and then resized to 224 × 224 pixels. More detailed descriptions of the experimental setup are provided in our previous articles [33,34].

High-Speed Flame Image Sequence Dataset
For the machine learning of this study, flame images and dynamic pressure data were acquired during transitions from an unstable to a stable state or vice versa. Table 1 shows our experimental conditions with 6 cases. Initial and final fuel supply conditions were predetermined from the results of our previous studies [33,34], which have reported instability characteristics at only steady-state conditions. In Case 1, stability changes from an unstable to a stable state at the constant heat input of 50 kWth by changing fuel composition from 100% CH4 to 12.5% H2 and 87.5% CH4. In other cases, inverse conditions from stable to unstable were offered by changing either fuel composition or heat input. One of the three phases of thermoacoustic stability (i.e., one of stable, transient, and unstable combustion states) was assigned to each flame image based on the dynamic pressure (DP) signal. Because the amplitude of dynamic pressure changes with operating conditions (e.g., fuel composition, equivalence ratio, flow rate), we used adaptive thresholds for different cases. To set the thresholds, we used stabilized DP signals computed as: where DP(i) is the dynamic pressure at time step i, and n = 160 is the duration corresponding to 10 ms. For each case, we first manually selected stable sections based on our prior knowledge and the overall structure of the DP signal. The maximum DP RMS was then found from the stable sections and used as the threshold separating stable states from transient states. The threshold separating transient states from unstable states was determined by introducing the concept of the cut-off frequency. For each case, the difference between the maximum DP RMS and the minimum DP RMS was computed. Denoting the minimum as DP RMS,min and the difference as ∆DP RMS , the threshold was set to DP RMS,min + 1/ √ 2 · ∆DP RMS . This threshold is about 3 to 4% of the static pressure of the combustion chamber, which is almost the same as the standard used in gas turbines, 3% to 5%. Figure 7 shows sample images from the 6 different cases. To apply early stopping [35] and to evaluate the generalization capability, each image sequence is divided into 80% of training, 10% of validation, and 10% of test sets. As depicted in the figures, the three sets are mutually exclusive.

Training
For every network model, we minimize the cross entropy between the model's output y and its corresponding target output z. For example, if the target state is stable, then z = (1, 0, 0). The cross entropy loss is defined as follows: The loss is minimized by the Adam optimizer [36] with a learning rate of 0.001, momentum coefficients of (β 1 , β 2 ) = (0.9, 0.999) and a mini-batch size of 32. The Adam optimizer is a variant of the stochastic gradient descent method with adaptive learning rates based on the first and second moment estimates. At each iteration, the loss in Equation (17) is averaged over the mini-batch and the average loss is minimized.
Each network model was trained over the entire training sets for 30 epochs while monitoring the validation accuracy. We repeated the training procedure three times with different random seeds and stored the best model parameters with the highest validation accuracy.
When training the late fusion layer, we used the pre-trained parameters of the ResNets and PS-ResNets. Only the parameters of the late fusion layer have been trained following the same aforementioned protocol. In this case, the cross entropy between the late-fused outputỹ and the target output z was minimized. The initial weights of the late fusion layer were chosen as the manual setting described in Section 2.4. Thus, the training tries to find a local maximum of the validation accuracy near the initial weights. Tables 2 and 3 show the test set classification accuracy of the ResNets with and without the proposed early fusion. Although some outliers exist, the accuracy over the entire test set ("All" in Tables 2 and 3) tends to increase with T e , showing the effectiveness of early fusion. The PS-ResNet gives higher overall accuracies for the majority of the combinations of the architectures and T e . However, the difference is less than 1% for most pairs of accuracies. Thus, the power spectral density does not seem to be vital for our dataset. To Cases 1, 2, 4, 5, and 6, all the models give high accuracy greater than 96.0%. When T e = 16, the lowest accuracy is 98.8%. However, the models do not give as high accuracy to Case 3.

Effect of Early Fusion
To analyze the relative degradation, we computed the approximate fundamental period of each image sequence by computing the average root mean square error (RMSE) between I (t−∆t) and I (t) as shown in Figure 8. Figure 9 shows subsequent image frames randomly sampled from each sequence, supporting the approximate fundamental periods. Case 3 clearly shows a period of ∆t = 5 in Figure 8. Thus, the lack of periodicity does not seem to be the reason for the degradation. A greater setting of T e may deliver better results at the expense of increased computational resources and time; however, T e = 16 is already greater than all the approximate fundamental periods. According to our theory in Section 2.3, a greater setting of T e than all the fundamental periods is sufficient. For further analysis, we computed a confusion matrix of Case 3. Table 4 shows the confusion matrix corresponding to the worst accuracy attained by the ResNet-50 with T e = 2. The confusion matrix shows that the main error source is the confusion between stable and transient states. Especially, more mistakes of transient states for stable states have been made. The test accuracy of the first transient section in Figure 7c is 9.2% while that of the second transient section is 100%. Not only the dynamic pressure of the first transient section is only slightly higher than that of the stable sections, but also the images of the first transient section are highly similar to those of the second stable section, as shown in Figure 10. Consequently, the classification across the two sections is more difficult than the remaining ones. Additionally, the stable and transient sections in Figure 10 do not show strong periodicity. This is the reason for lower performance of Case 3 with large T e 's in Tables 2 and 3.

Effect of Late Fusion
Our late fusion method can be applied to both ResNet and PS-ResNet outputs. In addition, the weights of the proposed late fusion layer can be either learned from data or handcrafted as we discussed in Section 2.4. Tables 5 and 6 show classification accuracies over all 6 cases according to different combinations of the proposed late fusion methods and backbone architectures. In Appendix A, Tables A1-A6 show detailed classification accuracies for each case. For all tests, we used a fixed setting of T l = 16. Compared to the baseline method with single-image inputs (the ResNet-18 with T e = 1), the proposed method (the PS-ResNet with T e = 16 + handcrafted late fusion) gives approximately 3.5% improvement in classification accuracy. It is interesting to notice that the late-fusion method with handcrafted weights is more effective for most combinations than that with learned weights. Because the initial weights of the learned late-fusion layer are set as the handcrafted ones, the learned weights may be worse if the initial weights are already locally optimal. In addition, improving the weights based on the validation accuracy can lead to overfitting to the validation data. We checked the validation accuracies and found that those of the learned late fusion is similar to those of the handcrafted late fusion. This means that the handcrafted weights tend to be very close to local optima already.
However, there exists a combination (the ResNet-18 with T e = 2) for which the learned late-fusion most greatly improves the accuracy. Figure 11 shows the learned weights of the late fusion layer. In Figure 11, the absolute weights of the the last frame are the greatest and the absolute weights decrease with the temporal difference from the last frame. From Tables A1 and A2, we can see that the weights has learned to increase the accuracy of Case 3. We can conclude that learning the late-fusion weights can hardly lead to better generalization to the test data; however, it has the potential to produce a weight distribution that can better handle difficult cases. To verify our finding that the handcrafted weights give better results in an average sense, we varied learning of the late fusion layer in two ways: (1) We randomly initialized the weights according to the method in Reference [37]. (2) We applied a different optimization method named RMSprop [38]. In Tables 5 and 6, the two methods are denoted as "with random initialization" and "with RMSprop", respectively. Tables A3, A4, A7 and A8 show their detailed results. The random initialization typically gives the same result as the fixed initialization, although a slight improvement can be found for several cases. Changing the optimizer shows a similar effect. Both kinds of variations did not deliver better results than the handcrafted ones.

Processing Time
We measured the computation time of the proposed method using a computer running Ubuntu 18.04 with an Intel Xeon 4114 processor, an NVIDIA TITAN Xp graphics card, and 256 GB of RAM. The GPU was used for the network inference. Our code was written in Python and the PyTorch package [39] was used for the implementation. Table 7 shows the average processing time per time step. The best performing method takes 8.44 ms, which corresponds to 118 fps. Besides, application of the proposed fusion methods adds only 2 to 3 ms to the baseline. In our late fusion method, previous outputs are stored in memory so that they do not have to be recalculated to obtain the current output. Thus, the additional computational load is very low.

Conclusions
We proposed two kinds of fusion layers for improving the performance of classification networks on sequential flame image inputs. The proposed early fusion layer can learn to extract the power spectral density of subsequent image frames. The proposed late fusion layer combines the outputs at different time steps to predict the current output. The proposed layers have been combined with ResNet architectures to classify the combustion state of hydrogen and methane syngas.
The proposed models for combustion instability classification based on the flame images were validated by achieving high accuracy of 95.1∼98.6% and short processing time of 5.2∼12.2 ms.
The accuracy approximately increased with T e , the number of input flame images. With T e = 16, PS-ResNets showed slightly higher performance than ResNets. Approximately 3.5% improvement in classification accuracy was obtained by the proposed fusion methods. The best performing method was the combination of our early fusion layer, ResNet-18, and our late fusion layer with handcrafted weights. Our in-depth validation tests showed interesting results such that simply increasing the number of input images is as competitive as using the proposed early-fusion layer. In addition, using handcrafted weights for the late-fusion layer was shown to be more effective than using learned weights. However, the learned weights were shown to have the potential to improve the performance on difficult cases in which the flame images are highly ambiguous.
In this work, we used regular time steps for the two fusion methods; however, for the late fusion method, there is no theoretical constraint not to use irregular time steps. In our future research, we are going to investigate learning the time steps for maximizing the classification accuracy.

Acknowledgments:
The authors are grateful to Youn Shik Byun for the fruitful discussion on signal processing topics.

Conflicts of Interest:
The authors declare no conflict of interest.