Performance Sensing Data Prediction for an Aircraft Auxiliary Power Unit Using the Optimized Extreme Learning Machine

The aircraft auxiliary power unit (APU) is responsible for environmental control in the cabin and the main engines starting the aircraft. The prediction of its performance sensing data is significant for condition-based maintenance. As a complex system, its performance sensing data have a typically nonlinear feature. In order to monitor this process, a model with strong nonlinear fitting ability needs to be formulated. A neural network has advantages of solving a nonlinear problem. Compared with the traditional back propagation neural network algorithm, an extreme learning machine (ELM) has features of a faster learning speed and better generalization performance. To enhance the training of the neural network with a back propagation algorithm, an ELM is employed to predict the performance sensing data of the APU in this study. However, the randomly generated weights and thresholds of the ELM often may result in unstable prediction results. To address this problem, a restricted Boltzmann machine (RBM) is utilized to optimize the ELM. In this way, a stable performance parameter prediction model of the APU can be obtained and better performance parameter prediction results can be achieved. The proposed method is evaluated by the real APU sensing data of China Southern Airlines Company Limited Shenyang Maintenance Base. Experimental results show that the optimized ELM with an RBM is more stable and can obtain more accurate prediction results.


Introduction
The aircraft auxiliary power unit (APU) is designed to provide power for the aircraft independently [1]. In fact, the core of the APU is a small gas turbine engine providing power and compressed air [2,3]. Before the aircraft takes off, the APU provides power for lighting and air conditioning in the cabin, and also provides the compressed air for starting the main engines of the aircraft. After climbing to a certain height, the APU is shut down. When an emergency situation occurs (e.g., fault of the main engine), the APU can be started again to help restart the main engines. After landing, the APU supplies power for lighting and air conditioning again. In this way, the main engines can be turned off earlier to save fuel and reduce noise and emissions.
Since the ELM was proposed, it has been widely utilized in several fields. In fact, the ELM is one kind of single-hidden layer feed forward neural network [29]. Compared with the traditional back propagation neural network algorithm [30], the ELM has the advantages of a faster learning speed and better generalization performance. In addition, it does not need to be adjusted during the training process. Only the number of neurons in the hidden layer need to be set to obtain the unique optimal solution. Thus, the ELM attracts the attention of scholars and has been successfully applied in many fields. Chaturvedi et al. [31] utilized an ELM to perform subjectivity detection. Cao et al. [32] proposed an enhanced ensemble-based ELM and sparse representation classification algorithm to realize image classification, which incorporates multiple ensembles to enhance the reliability of the classifier. Zhang et al. [33] utilized an ELM to improve the performance of the probabilistic semantic model. However, the original ELM often leads to unstable prediction results [34]. In this study, the restricted Boltzmann machine (RBM) is adopted to optimize the ELM [35,36]. Then, a structurally stable ELM can be obtained. To evaluate the proposed method, the real sensing data from China Southern Airlines Company Limited Shenyang Maintenance Base (SYMOB) are adopted to carry out the evaluation experiments. Comparison experiments on three groups of data show that the optimized ELM has a more stable and better prediction performance.
The rest of this article is organized as follows. Section 2 introduces the proposed method and related theories. Section 3 presents experimental results and detailed discussion. Section 4 draws the conclusion and the future works.

Performance Sensing Data Prediction Using the Optimized ELM via the RBM
In this section, the proposed method is first illustrated. Then, the related theories such as the RBM and the ELM are presented in detail.

The Proposed Method
The ELM has the advantages of a faster learning speed and better generalization performance. To deal the problem of slow training of neural networks with the back propagation algorithm, an ELM was adopted to predict the performance sensing data of the APU in this study. However, the randomly generated weights and thresholds of the ELM often lead to unstable prediction results. To address this problem, an RBM was utilized to optimize the connection weights and the thresholds between the input layer and the hidden layer. In this way, a stable performance sensing data prediction model of the APU is expected to be achieved. The prediction results can also be obtained, and better prediction can be realized. The proposed method is illustrated in Figure 1.
To conduct the proposed method, the specific procedures are explained as follows.
Step 1. Choose the EGT data as the key parameter of the performance sensing data, which are from condition monitoring data of the APU. Preprocess the original data and divide the preprocessed data into training data and test data, respectively.
Step 2. Initiate the related parameters of the RBM and train the RBM with the available data.
Step 3. After the RBM is well trained, the weights and thresholds of the RBM are assigned to the ELM.
Step 4. Utilize the training data to train the ELM. In this way, all parameters of the ELM can be determined.
Step 5. Predict the EGT of the next cycle based on the historical EGT. The predicted EGT data are used as part of the historical EGT. These combined data are utilized to predict the EGT of the next cycle. This process is repeated until the preset prediction steps are met.
Step 6. Evaluate the prediction results by appropriate metrics. The error function in the ELM can be regarded as an energy function. In general, the optimal solution of network parameters can be obtained using the generalized inverse algorithm. However, if the weights and thresholds of the input layer and the hidden layer are generated randomly, the ELM network often falls into the local minimum point but fails to reach the global minimum point. The reason is that the error function or energy function of the network is a nonlinear space with multiple minimum points, and the random weight and threshold generated by the ELM will cause the network to fall into the local minimum value due to its randomness. Compared with the ELM, the main difference between a random network (e.g., RBM) and the ELM lies in the learning stage. Unlike other networks, a random network does not adjust weights based on a certain deterministic algorithm but modifies according to a certain probability distribution. In this way, the aforementioned defects can be effectively overcome. The net input of a neuron does not determine whether it is in a state of 1 or 0, but it does determine the probability that it is in a state of 1 or 0. This is the basic concept of a random neural network algorithm. The error function in the ELM can be regarded as an energy function. In general, the optimal solution of network parameters can be obtained using the generalized inverse algorithm. However, if the weights and thresholds of the input layer and the hidden layer are generated randomly, the ELM network often falls into the local minimum point but fails to reach the global minimum point. The reason is that the error function or energy function of the network is a nonlinear space with multiple minimum points, and the random weight and threshold generated by the ELM will cause the network to fall into the local minimum value due to its randomness. Compared with the ELM, the main difference between a random network (e.g., RBM) and the ELM lies in the learning stage. Unlike other networks, a random network does not adjust weights based on a certain deterministic algorithm but modifies according to a certain probability distribution. In this way, the aforementioned defects can be effectively overcome. The net input of a neuron does not determine whether it is in a state of 1 or 0, but it does determine the probability that it is in a state of 1 or 0. This is the basic concept of a random neural network algorithm.
For the RBM network, with the evolution of the network state, the energy of the network always changes in the direction of decreasing in the sense of probability. This means that although the overall trend of the network energy is to evolve in the direction of a decrease, it cannot be excluded that some neuron states may have a value with a small probability. Thus, the network energy is increased temporarily. It is precise because of this possibility that the RBM network has the ability to jump out from the local minimum trough, which is the fundamental difference between the RBM and the ELM. This operation is called the search mechanism, which means that the network is in the process of running a continuous search for lower energy minima until the global minimum of the energy is achieved.

Restricted Boltzmann Machine
The RBM has only two layers of neurons, as shown in Figure 2. For the RBM network, with the evolution of the network state, the energy of the network always changes in the direction of decreasing in the sense of probability. This means that although the overall trend of the network energy is to evolve in the direction of a decrease, it cannot be excluded that some neuron states may have a value with a small probability. Thus, the network energy is increased temporarily. It is precise because of this possibility that the RBM network has the ability to jump out from the local minimum trough, which is the fundamental difference between the RBM and the ELM. This operation is called the search mechanism, which means that the network is in the process of running a continuous search for lower energy minima until the global minimum of the energy is achieved.

Restricted Boltzmann Machine
The RBM has only two layers of neurons, as shown in Figure 2. vh , the energy of the RBM is defined by is the parameter of the RBM, and ij W represents the connection weight between visible unit i and hidden unit j .
where () Z θ is the normalized factor (also known as the partition function). The distribution ( , ) P v θ of observed data v defined by the RBM is the essential issue. To determine this distribution, the normalized factor () Z θ needs to be calculated. When the state of visible units in the RBM are given, the activation state of each hidden unit is conditionally independent. At this point, the activation probability of the unit j in the hidden layer is where  is the sigmoid activation function. Since the structure of the RBM is symmetric, the activation state of each visible unit is also conditionally independent when the state of the hidden units is given. The activation probability of the unit i in the visible layer is It should be noted that there are no interconnections among the neurons in the visible and hidden layers. Only the inter-layer neurons have symmetrical lines and their relationship is independent, as given by The first layer v(v 1 , v 2 , · · · , v n ) is named as the visible layer, which consists of visible units for training data input. The other layer h(h 1 , h 2 , h 3 , · · · , h m ) is named as the hidden layer, which consists of hidden units.
If the RBM includes n visible units and m hidden units, the vectors v and h can be used to represent the states of the visible and hidden units, respectively. v i denotes the state of the unit i in the visible layer and h j is the state of the unit j in the hidden layer. For the set (v, h), the energy of the RBM is defined by where θ = W ij , a i , b j is the parameter of the RBM, and W ij represents the connection weight between visible unit i and hidden unit j. a i is the bias of the visible unit i, and b j represents the bias of the hidden unit j. When the parameters are determined, the joint probability distribution of (v, h) can be obtained by where Z(θ) is the normalized factor (also known as the partition function). The distribution P(v, θ) of observed data v defined by the RBM is the essential issue. To determine this distribution, the normalized factor Z(θ) needs to be calculated. When the state of visible units in the RBM are given, the activation state of each hidden unit is conditionally independent. At this point, the activation probability of the unit j in the hidden layer is where σ is the sigmoid activation function. Since the structure of the RBM is symmetric, the activation state of each visible unit is also conditionally independent when the state of the hidden units is given. The activation probability of the unit i in the visible layer is It should be noted that there are no interconnections among the neurons in the visible and hidden layers. Only the inter-layer neurons have symmetrical lines and their relationship is independent, as given by When the hidden layer is given, all explicit values are not related to each other, as illustrated by With this property, it is not needed to calculate each neuron at every step. Instead, the neurons in the entire layer can be calculated by the parallel mode.
The training target of the RBM is to find the maximal probability distribution of hidden units with the training sample. Since the decisive factor lies in the weight W, the object of training the RBM is to determine the optimal weight.
The marginal distribution of joint probability distribution P is the likelihood function, as defined by When the training data D are given, the goal of training the RBM is to maximize the following likelihood: Equation (9) can be equivalent to A random gradient descent can be applied to solve the former problem. The derivative of ln P(v) with respect to θ needs to obtained, as illustrated by ∂θ is easily calculated. However, P(v, h) represents the joint distribution of visible layer units and hidden layer units, which involves the normalized factor Z. This distribution is difficult to obtain. Therefore, we cannot calculate the second term, only its approximation can be achieved through some sampling methods.
Then, Equation (12) can be obtained, as given by When θ equals W ij , we can get When θ equals a i , the following equation can be reached: When θ equals b i , Equation (15) can be achieved.
The computational complexity of h in the above three equations is 2 n+m . Therefore, the Markov chain Monte Carlo (MCMC) method, such as the Gibbs sampling method, is usually adopted for sampling, and uses samples to estimate h . However, each time MCMC sampling is performed, sufficient state transitions are required to ensure that the collected samples conform to the target distribution. It needs to collect a large number of samples accurately enough. These requirements greatly increase the complexity of RBM training. In this study, the contrastive divergence (CD) algorithm is adopted to obtain the parameters of the RBM. The K step CD algorithm is described as follows.
Let the connection weight matrix be W. The bias vector of the visible layer and the bias vector of the hidden layer are represented by a and b, respectively. The CD algorithm is shown in Figure 3.  ( 1 ) EndFor Figure 3. The training process of contrastive divergence (CD).
After training, the RBM can accurately extract the features of the surface layer. Based on these features, the hidden layer can help in reconstructing the surface layer. As aforementioned, the original ELM performance is easily affected by the initialization of weights and thresholds. In this study, to solve this problem, the RBM was trained first. Then, the weights and thresholds of the trained RBM were transmitted to the ELM. In this way, an ELM with better performance can be obtained.

Extreme Learning Machine
To solve the problem of the slow learning speed of traditional feed forward neural networks, Huang et al. [28] proposed a new learning algorithm, which is called the ELM. The structure of the ELM is the same as the traditional single hidden layer neural network, as shown in Figure 4. In Figure 4, the input layer of the ELM contains m neurons, the hidden layer has l neurons, and the output layer contains n neurons. After training, the RBM can accurately extract the features of the surface layer. Based on these features, the hidden layer can help in reconstructing the surface layer. As aforementioned, the original ELM performance is easily affected by the initialization of weights and thresholds. In this study, to solve this problem, the RBM was trained first. Then, the weights and thresholds of the trained RBM were transmitted to the ELM. In this way, an ELM with better performance can be obtained.

Extreme Learning Machine
To solve the problem of the slow learning speed of traditional feed forward neural networks, Huang et al. [28] proposed a new learning algorithm, which is called the ELM. The structure of the ELM is the same as the traditional single hidden layer neural network, as shown in Figure 4. ( 1 ) EndFor Figure 3. The training process of contrastive divergence (CD).
After training, the RBM can accurately extract the features of the surface layer. Based on these features, the hidden layer can help in reconstructing the surface layer. As aforementioned, the original ELM performance is easily affected by the initialization of weights and thresholds. In this study, to solve this problem, the RBM was trained first. Then, the weights and thresholds of the trained RBM were transmitted to the ELM. In this way, an ELM with better performance can be obtained.

Extreme Learning Machine
To solve the problem of the slow learning speed of traditional feed forward neural networks, Huang et al. [28] proposed a new learning algorithm, which is called the ELM. The structure of the ELM is the same as the traditional single hidden layer neural network, as shown in Figure 4. In Figure 4, the input layer of the ELM contains m neurons, the hidden layer has l neurons, and the output layer contains n neurons. In Figure 4, the input layer of the ELM contains m neurons, the hidden layer has l neurons, and the output layer contains n neurons.
Let W i and b i denote the connection weight and bias of the input layer and hidden layer. The output y j of the neural network is given by l j=1 β j g(W j X q + b j ) = y q , q = 1, 2, . . . , Q, where g(x) is the activation function, β j is the output weight. If there are Q arbitrary samples, the input matrix X and output matrix Y are expressed by y 11 y 12 · · · y 1n y 21 y 22 · · · y 2n . . . . . . · · · . . .
If the output matrix of the hidden layer is defined as H, then H can be given by Similarly, β can be expressed by From Equation (22) to Equation (25), Equation (26) can be derived: Let (X q , t q ) denote an arbitrary sample, where the goal of the ELM is to minimize the output error, which is given by Equation (27) can be transformed into min β Hβ = T .
By solving the least squares solution of Equation (28), the output weight β can be determined bŷ where H † is the Moore-Penrose generalized inverse of the matrix H.

Experimental Results and Discussion
To verify the effectiveness of the proposed method, the real APU sensing data from China Southern Airlines Company Limited Shenyang Maintenance Base were utilized. Three evaluation metrics were utilized to evaluate the performance sensing data prediction.

Data Description
The condition monitoring data of the APU were from the aircraft communications addressing and reporting system, which mainly consist of four segments. They are the header, resume information, the operating parameters of the aircraft main engine, and the starting parameters of the APU. The header contains aircraft flight information, message generation, bleed valve status, opening angle, and total temperature. Resume information includes the APU serial number, operation hours, and number of cycles. The operating parameters are comprised of control command, exhaust gas temperature, guide vane opening angle, compressor inlet pressure, load compressor inlet port temperature, bleed air flow, bleed air pressure, oil temperature, and generator load. The starting parameters are made up of start-up interval, exhaust gas temperature peak, peak speed, oil temperature, and inlet temperature. Among the aforementioned parameters, EGT is the key performance parameter that can be utilized to predict APU degradation.

Evaluation Metrics
Let y(i) (i = 1, 2, . . . , N) denote the real measured data and p(i) (i = 1, 2, . . . , N) refers to the predicted data. N indicates the number of predicted steps. The utilized metrics are given as follows.
(1) Mean absolute error (MAE) In statistics, MAE is the quantity that can be used to measure how close the prediction data and the actual data are. A smaller value of MAE indicates a better accuracy of the prediction model.
(2) Mean absolute percent error (MAPE) MAPE is adopted to measure the intuitive interpretation in the prospect of relative error.
(3) Root mean square error (RMSE) RMSE represents the expected data of the squared error. A smaller RMSE value denotes better stability of the prediction model.

Experimental Results
In this section, the comparison experiments between the original ELM and the proposed method are carried out. To fully measure the effectiveness of the proposed method, three groups of experiments with different training data were conducted.
The number of training data points varies from 240 to 320, while the number of test data points varies from 21 to 41 for prediction accuracy assessment. The details of these three datasets are shown in Table 1.

Experiments Implemented with the ELM
The utilized ELM in those experiments had a hidden layer, which consisted of 20 ELM neurons. The ELM was firstly trained with the training data. Then, the trained ELM was adopted for prediction. The prediction results of the first group of data are shown in Figure 5.  Figure 5 shows the experimental results with the data of the first group. The experiment was conducted for a total of 10 times, and the results of the first two experiments are shown in Figure 5; the dotted line represents the actual measured EGT, and the starred line represents the predicted EGT. It can be intuitively seen that the prediction results of the ELM are unstable. In addition, the prediction deviation is relatively larger when the number of prediction steps increases. The detailed evaluation metrics of ELM prediction utilizing the first group of data are illustrated in Figure 6. Figure 6 shows the evaluation metrics of the experiments. As shown in the charts, each experiment produces a unique prediction result. To further explore the prediction performance of the ELM, experiments were conducted 10, 20, 50, 100, and 200 times, respectively. The details of the experimental results are listed in Table 2.  Figure 5 shows the experimental results with the data of the first group. The experiment was conducted for a total of 10 times, and the results of the first two experiments are shown in Figure 5; the dotted line represents the actual measured EGT, and the starred line represents the predicted EGT. It can be intuitively seen that the prediction results of the ELM are unstable. In addition, the prediction deviation is relatively larger when the number of prediction steps increases. The detailed evaluation metrics of ELM prediction utilizing the first group of data are illustrated in Figure 6. Figure 6 shows the evaluation metrics of the experiments. As shown in the charts, each experiment produces a unique prediction result. To further explore the prediction performance of the ELM, experiments were conducted 10, 20, 50, 100, and 200 times, respectively. The details of the experimental results are listed in Table 2.  Figure 5 shows the experimental results with the data of the first group. The experiment was conducted for a total of 10 times, and the results of the first two experiments are shown in Figure 5; the dotted line represents the actual measured EGT, and the starred line represents the predicted EGT. It can be intuitively seen that the prediction results of the ELM are unstable. In addition, the prediction deviation is relatively larger when the number of prediction steps increases. The detailed evaluation metrics of ELM prediction utilizing the first group of data are illustrated in Figure 6. Figure 6 shows the evaluation metrics of the experiments. As shown in the charts, each experiment produces a unique prediction result. To further explore the prediction performance of the ELM, experiments were conducted 10, 20, 50, 100, and 200 times, respectively. The details of the experimental results are listed in Table 2.  Experimental results with the data of the second group are shown in Figures 7 and 8. The experimental setting conditions are the same as the previous experiments.   Experimental results with the data of the second group are shown in Figures 7 and 8. The experimental setting conditions are the same as the previous experiments.

Experiments Implemented with the Proposed Method
The prediction results of the proposed method are shown in Figures 11-13. Figure 11. Experimental results of the proposed method using the first group of data. Figure 11. Experimental results of the proposed method using the first group of data.  The two curves in Figures 11-13 have the same meaning as those in Figures 5, 7 and 9. It can be seen that the two curves are close to each other, which denotes that the predicted EGT data are more precise. As shown in Section 3.3.1, there is no evidence shown that more experiments make better predictions on average. Thus, in the comparison experiments, the average results of 10 times are adopted. Results of comparison experiments are given in Tables 5-7.   The two curves in Figures 11-13 have the same meaning as those in Figures 5, 7 and 9. It can be seen that the two curves are close to each other, which denotes that the predicted EGT data are more precise. As shown in Section 3.3.1, there is no evidence shown that more experiments make better predictions on average. Thus, in the comparison experiments, the average results of 10 times are adopted. Results of comparison experiments are given in Tables 5-7.  The two curves in Figures 11-13 have the same meaning as those in Figures 5, 7 and 9. It can be seen that the two curves are close to each other, which denotes that the predicted EGT data are more precise. As shown in Section 3.3.1, there is no evidence shown that more experiments make better predictions on average. Thus, in the comparison experiments, the average results of 10 times are adopted. Results of comparison experiments are given in Tables 5-7.

Discussion
Since the prediction results of the ELM were not stable, the ELM was conducted for 10, 20, 50, 100, and 200 times, respectively, in each group of experiments. The average results of 10 times were considered as the comparison experiment. ELM prediction results for each group of experiments are randomly listed in Figures 5, 7 and 9, respectively. It can be seen intuitively from the figures that the randomly initialized ELM prediction results are not stable, and also fail to predict the degradation trend of the APU.
As can be seen from Figures 6, 8 and 10, the evaluation indexes of each experiment are different. The experiments also confirm that the random initialization of weights and thresholds will lead to unstable ELM prediction results.
From Table 2 to Table 4, the evaluation metrics of each group under different experiment times are given. In fact, this method belongs to ensemble learning. By using the ensemble learning approach, more stable prediction results can be obtained. However, in Table 2, when the number of experiments is 50, the prediction results are the worst, in which MAE, MAPE, and RMSE are 10.2111, 1.8165, and 11.8673, respectively. When the experiment number is set as 50, the best prediction results can be obtained, in which the MAE, MAPE, and RMSE are 9.2661, 1.6488, and 10.7172, respectively. While in Table 3, when the number of experiments is 10, the prediction results are the worst, in which MAE, MAPE, and RMSE are 14.8342, 2.5747, and 16.4750, respectively. When the experiment number is set as 20, the best prediction results can be obtained, in which the smallest MAE, MAPE, and RMSE are 14.0302, 2.4360, and 15.6901, respectively. As shown in Table 4, when the number of experiments is 100, the prediction results are the worst, in which MAE, MAPE, and RMSE are 11.4609, 1.9621, and 13.1249, respectively. When the experiment number is set as 20, the best prediction results can be obtained, in which the smallest MAE, MAPE, and RMSE are 8.9778, 1.5366, and 10.2694, respectively. Experimental results show that the prediction results will not necessarily be improved as the number of experiments increases. The reason is that the prediction results of the ELM are not stable.
As shown from Table 5 to Table 7, although experimental results of the original ELM have been integrated and the averaged value is adopted, the values of the three metrics are all larger than those of the proposed method. By using the proposed method, MAE, MAPE, and RMSE have declined to 35.7%, 36.1%, and 45.7% of their initial value in Group 1, respectively. In the second group, MAE, MAPE, and RMSE have declined to 27.5%, 27.6%, and 28.1% of their initial value, respectively. In the third group, MAE, MAPE, and RMSE have declined to 21.9%, 21.9%, and 23.8% of their initial value, respectively. Therefore, compared with the original ELM, the proposed method can obtain better results of performance sensing data prediction. To be specific, the accuracy, relative error, and stability of the proposed method are all relatively better.

Conclusions
In this article, we studied the performance sensing data prediction of an aircraft APU. To utilize the nonlinear features contained in these data, the optimized ELM using an RBM was proposed. In this way, the relatively appropriate weights and thresholds of the ELM for implementing prediction can be determined. Three kinds of evaluation experiments were implemented in different lengths of training data and test data. Compared with the original ELM, the proposed method can achieve better accuracy and stable prediction results. Therefore, the proposed method not only provides one feasible way for predicting the performance sensing data that can be regarded as an indicator of the health condition of the APU, but also offers the idea for improving the ELM. However, the influence of environmental factors (e.g., ambient temperature and atmospheric pressure) on EGT sensing data was not considered. We will take the physical model of the bleed air performance to modify EGT data. It is expected to enhance the prediction results even further.