Improved LS-SVM Method for Flight Data Fitting of Civil Aircraft Flying at High Plateau

: High-plateau ﬂight safety is an important research hotspot in the ﬁeld of civil aviation transportation safety science. Complete and accurate high-plateau ﬂight data are beneﬁcial for effectively assessing and improving the ﬂight status of civil aviation aircrafts, and can play an important role in carrying out high-plateau operation safety risk analysis. Due to various reasons, such as low temperature and low pressure in the harsh environment of high-plateau ﬂights, the abnormality or loss of the quick access recorder (QAR) data affects the ﬂight data processing and analysis results to a certain extent. In order to effectively solve this problem, an improved least squares support vector machines method is proposed. Firstly, the entropy weight method is used to obtain the index weights. Secondly, the principal component analysis method is used for dimensionality reduction. Finally, the data are ﬁtted and repaired by selecting appropriate eigenvalues through multiple tests based on the LS-SVM. In order to verify the effectiveness of this method, the QAR data related to multiple real plateau ﬂights are used for testing and comparing with the improved method for veriﬁcation. The ﬁtting results show that the error measurement index mean absolute error of the average error accuracy is more than 90%, and the error index value equal coefﬁcient reaches a high ﬁt degree of 0.99, which proves that the improved least squares support vector machines machine learning model can ﬁt and supplement the missing QAR data in the plateau area through historical ﬂight data to effectively meet application needs.


Introduction
High-plateau flights represent an important safety issue for civil aviation, especially for China's civil aviation transportation. High-plateau airports are mainly distributed in China, Nepal, Peru, Bolivia, Ecuador, and other countries. Among the 42 high-plateau airports in the world, 16 are located in China, so their operation safety problems have a profound impact on China's civil aviation [1]. On 14 May 2018, the flight mission of Chinese Sichuan Airlines flight 3U8633 from Chongqing to Lhasa plateau was an example of the typical unsafe event; the front windshield of the cockpit burst and fell off during the flight in high-plateau airspace, and the crew made an emergency descent. Compared with ordinary flight, high-plateau flight has low air density and atmospheric pressure, complex terrain, solar radiation, uneven heating of the terrain facing the sun, and many other environmental characteristics which result in stricter takeoff and landing conditions for aircrafts on high plateaus. The technical requirements of the personnel are more stringent and certain factors such as modification on the basis of ordinary civil aircrafts will cause the flight parameters of high-plateau civil airliners to change from those of civil airliners on general routes. During the entire flight phase, the quick access recorder (QAR) data may be abnormal or lost due to the influence of the high plateau's harsh environment, detection equipment, transmission equipment, or other unknown conditions. QAR is an important data warehouse for post-flight flight technical analysis, engine health analysis, flight safety incident investigation, flight quality analysis, operational quality analysis, and aircraft health management. The abnormality of these data will bring inconvenience and hidden hazards for monitoring and analyzing the safety status of high-plateau flights for theoretical research.
Many scholars have carried out fruitful research on flight data analysis and application, mainly focusing on flight data processing, flight data application, and other application research. Flight data have many applications in aviation operation safety research [2][3][4][5][6]. Some scholars have applied flight data to turbine fault diagnosis, general aviation anomaly detection, aviation safety key landing index prediction [7][8][9][10][11], tower flight data manager man-machine system integration design processes, and new methods for nonlinear aerodynamic modeling of flight data [12][13][14]. Some scholars also analyze the flight characteristics of QAR data for landing at high-altitude airports, and use it for airline flight data monitoring machine learning methods, generating new operational safety knowledge from existing data, safety science insights gained from black-box-to-flight data monitoring, composite fault diagnosis using optimized MCKD and sparse representation of rolling bearings, rolling elements based on VMD, and sensitivity MCKD fault diagnosis, etc. [15][16][17][18][19]. Some scholars have carried out research on the impact of leveling operation on landing safety based on variance analysis of real flight data, civil aircraft hazard identification and prediction based on deep learning [20,21], unsteady aerodynamic modeling of unstable dynamic processes [22], and small-sample inspection data-driven diagnosis of critical deviation sources in aircraft structural assembly [23].
In the research of flight data processing methods and technologies, many scholars have also carried out a series of studies [24][25][26]. Some scholars have proposed improved binary gray wolf optimizer and support vector machine methods, arithmetic optimization algorithms, particle swarm optimization, average impact value-support vector machine algorithms, etc., for in-flight data processing and optimization [27][28][29]. Some scholars combined multiple classifiers to quantitatively sort the impact of anomalies in flight data based on frequency domain specification and improved particle swarm optimization algorithms, as well as enhanced fast non-dominated solution sorting genetic algorithms for multi-objective problems research [30][31][32].
In short, many scholars have carried out a series of researches on flight data collection and analysis, as well as application methods and technologies, and have also achieved many valuable results. However, research on high-altitude flight data is rare, especially research on the filling and simulation of flight data loss due to high altitude, low temperature, low pressure, and other elements of the special operating environments. To effectively solve the problem of high-plateau QAR flight data padding, an improved least squares support vector machines method is proposed. The entropy weight method is used to obtain the index weights, and the principal component analysis method is used for dimensionality reduction. The flight data are fitted and repaired by selecting appropriate eigenvalues through multiple tests based on LS-SVM. The data are fitted and repaired by selecting appropriate eigenvalues through multiple tests based on LS-SVM. In order to verify the effectiveness of this method, the QAR data related to multiple real plateau flights are used for testing and are compared with the improved method for verification.

LS-SVM Principle
The support vector machine is a generalized linear classifier proposed to perform binary classification of data in a supervised learning manner. Its decision boundary is the maximum margin hyperplane for the learning sample solution. The basic principle is shown in Figure 1. It is a machine learning method that is based on a complete statistical learning theory and has excellent learning capabilities. It has strict mathematical theory support, strong interpretability, and does not rely on statistical methods, thus simplifying the usual problems of classification and regression. It can also find key samples (support vectors) that are critical to the task. After adopting nuclear techniques, it can handle non-linear classification-regression tasks. The final decision function is determined by only a small number of support vectors and the complexity of the calculation depends on the number of support vectors, not the dimensionality of the sample space.
The LS-SVM demonstrates an improvement in the standard support vector machine, a new type of support vector machine method proposed by Suykens and Vandewalb. Compared with the standard SVM, it replaces the inequality constraints in SVM with equality constraints, which increases the convergence speed, improves classification progress in problems with desired goals, and achieves good results [33].
Supposing the data training set of a given LS-SVM is expressed as (1) x i ∈ R n is the n-dimensional system input vector, y i ∈ R n is the system output and f (x) = ω T ϕ(x) + b is the unknown function to be estimated. Making a nonlinear mapping γ: R n → H , where Φ is called the feature map and H is the feature space, the unknown function is estimated to use the function of the form (2).
Among them, ω is the weight vector in R n space, and b ∈ R is the bias. The SVM algorithm uses the kernel function of the original space to replace the dot product operation in the high-dimensional feature space, avoids complex operations, and uses structural risk to minimize as a learning rule, which is mathematically described as ωTω ≤ constant. The standard SVM algorithm takes the insensitive loss function as the structural risk minimization estimation problem. The meaning of the ε-insensitive loss function is as follows: when the difference between the observed value y of the x point and the predicted value f (x) does not exceed the predetermined ε, it is considered that the predicted value f (x) at this point is lossless, although the predicted value f (x) and the observed value y may not be equal. On the other hand, LS-VSM chooses the second norm e i of ξ i as the loss function to make the equation true. Therefore, the optimization equation is established as (3) and (4).
Here, γ is a real constant which determines the relative size of 1 2 ωTω and 1 2 ∑ N i=1 e 2 , which can be between the training error and the compromised model complexity so that the function can seek better generalization ability. The LS-SVM algorithm defines a loss function that is different from the standard SVM algorithm and changes its inequality constraints to equality constraints, which can obtain ω in the dual space. The Lagrange Function (5) is as follows: where α i ∈ R, α i > 0 is the Lagrange multiplier so the optimal solution condition is as follows (6): After eliminating ω and e i from Equation (6), this optimization problem is transformed into solving the following equation: Among them, y = [y 1 , y 2 , . . . , y N ] T , a = [a 1 , a 2 , . . . , a N ] T , 1 = [1, . . . , 1] T , and B represent a square matrix; the element in the i-th column and row j is N; and K x i , x j is the kernel function. On the basis of Formula (3), ω can be further obtained, so as to obtain the nonlinear approximation of the training data set

The Choice of Kernel Function
The kernel function is used to prevent the non-linear transformation from mapping its input space to the high-latitude space, causing particularly high-dimensional complex operations. When the support vector machine only needs the inner product operation and looks for a function that represents a low-dimensional input space that is exactly equal to the inner product in the high-dimensional space, the result can be obtained directly to avoid complicated operations. The choice of the kernel function requires Mercer's theorem to be satisfied, that is, any Gram matrix of the kernel function in the sample space is a semi-positive definite matrix (semi-positive definite) [34]. Currently, the commonly used kernel functions in research and practice are as follows: (1) Linear kernel function: (2) Polynomial kernel function: (d value is the order of the polynomial) (3) Radial basis kernel function: (4) B-spline kernel function: (5) Perceptual kernel function:

LS-SVM Principle
Entropy comes from physical thermodynamics and is one of the parameters that can characterize matter. It was first introduced into information theory by C.E. Shannony and called information entropy. The entropy weight method (EWM) abstracts information and tests its degree of variation through various eigenvalues. In this way, the weight of each feature is calculated and modified to achieve a more reasonable weight index [35]. The specific process is as follows: (1) Perform data standardization processing on each feature value. Suppose that k (2) Find the information entropy of each eigenvalue. According to the definition of information entropy in information theory, the information entropy of a set of data can be written as where ∑ n i=1 P ij lnP ij = 0, determine the weight w of each feature quantity:

Principles of Principal Component Analysis (PCA)
The principal component analysis (PCA) method is currently the most widely used data dimensionality reduction algorithm. It aims to sequentially find a set of mutually orthogonal coordinate axes from the original high-dimensional space to determine its correlation by comparing the variance of the original data under the new coordinate axis; the degree is used to exclude zero-correlation or low-correlation feature quantities to achieve a dimensionality reduction of data features. Because of the efficiency and simplicity of PCA processing high-dimensional data sets, it is widely used in various fields in practice, especially in the field of compressed data [36].

Verification Method
In order to judge the conformity of the selected number of feature quantities, the coefficient of determination (R 2 ) is introduced. The coefficient of determination indicates how much the fluctuation of the dependent variable can be described by the fluctuation of the independent variable. Its expression is as follows: y and ∧ y represent the actual value and the predicted value of the simulation result. The closer the R 2 value is to 1, the better the correlation between the two.
For the evaluation of the complementation results, four commonly used indicators for data repair are introduced for analysis purposes: mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), and equal coefficient (EC). The calculation is as follows: y and ∧ y still represent the actual value and predict the value of the simulation result, and N represents the number of samples in the training set. The smaller value of MSE, the higher the accuracy of the machine learning simulation results describing the experimental data. EC indicates the degree of fit between the output value and the true value. Generally, any value above 0.9 indicates a good fit.

Compensation Model and Simulation of High-Plateau Missing Data
The pseudo-compensation of missing data by other QAR data is essentially based on the existence of a certain functional relationship between QAR parameters. The value of the parameter can be derived from other parameter values. Therefore, the purpose of simulation is to determine this functional relationship. To be more specific, high-plateau flight data padding is essentially a function approximation problem.
This paper takes some flight parameters of QAR flight data as the assumed missing data in order to show the feasibility of this method. According to the actual meaning of the QAR data, the loss parameter N(τ) = N real is set as the missing QAR parameter, where τ is the current moment of the missing data and other intact QAR parameters are used as the known vector set ω τ T according to the previous setting. Finding a functional relationship between the two or its first approximation such that N(τ) = N real , the relationship model can be written as N(τ) = ω τ T ϕ(x, t) + b, where the parameter requirements are (8) the same, so LS-SVM can be used to complement the QAR loss parameters.

Data Selection
In order to verify the feasibility of the high-plateau QAR data patching, this paper collects ten flight data of a certain airline's civil transport aircraft in the same time period and the same origin and destination for simulation analysis. In order to reduce irrelevant external factors, interference data selection controls possible related variables, such as changing in crew members, and determines whether it is pre-flight or post-flight to ensure that the accuracy of the simulation is improved. After selection, nine groups were randomly selected as the model training group and the last group was used as the comparison group to test the accuracy of the experimental results.

Algorithm Improvement
Based on the support vector machine algorithm, an improved method is proposed for the shortcomings of difficulty in training and analyzing large-scale samples. The eigenvalue range definition plays a very important role in training. The input and output are put into a small range and then predicted by the support vector machine model. On the one hand, it can avoid overfitting caused by large-value data dominating small-value data. On the other hand, scaling the data to a small range can avoid the "dimension disaster" and reduce the computational load. The principal component analysis method, as a commonly used dimensionality reduction algorithm, can easily simplify and refine complex data, process the data through the entropy method, and complete the algorithm optimization to achieve concise and accurate data under the premise of ensuring the robustness of the data.

Algorithm Flow
Before the simulation starts, it is necessary to determine the key parameters γ and the core width σ 2 in advance and then use the above algorithm to perform simulation training to fill in the missing data; the specific details and steps are shown in Figure 2.

Simulation Application
QAR's overall data cannot be analyzed due to the existence of text items and 78 data items remain after all text items are excluded. Python is taken as the expected environment, which measures the weight of each item through the EWM method and divides the interval to select the data items for simulation training. After multiple rounds of testing, the coefficient of determination is compared. It is found that when the number of feature quantities is smaller, the coefficient of fit is larger and the change tends to stably increase; thus, few features are prone to overfitting. After weighing and selecting the 17 feature items with the largest weight, they have good accuracy and credibility. The relationship between the number of specific features and the accuracy rate, as well as the weight ratio of the feature quantity, are shown in Figure 3.
Compared to the algorithm without the improved method, the improved algorithm not only improves the fitting effect but also greatly reduces the amount of data in the simulation. The fitting coefficient is increased by 0.64% but the amount of data calculation is reduced by 78.21%. The details are shown in Table 1.  Among them, the selected feature quantities and the corresponding weights are shown in Table 2 and Figure 4.  Among them, the feature that has the greatest impact on the prediction is the true flight speed (TAS), and the feature that has the least impact is the right engine speed (N2_1). After determining the selection of the feature quantity, due to the large amplitude of the QAR data, in order to reduce the modeling error, the input data and the expected data were normalized on [−1, 0] and [0, 1], respectively. The original interval should be returned to after analysis. In this paper, the kernel function selects the most commonly used radial basis function for data repair: The simulation found that the parameters γ and the kernel width σ 2 have a significant impact on the complementation effect, which needs to be determined according to the specific characteristics of the training data. Generally speaking, a reduction in the kernel width σ 2 can improve the training accuracy but can reduce the generalization ability, and an increase in the parameter γ can also improve the training accuracy. The training shows that when the parameter γ = 3 and the training model is filled with missing data, the data with core width σ 2 = 0. 6     By observing the image, it is found that the data fitting degree of each factor and each stage is relatively good, so further simulation result analysis can be carried out.

Simulation and Discussion
The experimental results are analyzed through simulation methods, and the error indicators of the complement results are shown in Table 3. The error measurement index MAE in the table shows that the lower average error accuracy is more than 90% and the error index value EC in the table has reached a high degree of fit of 0.99. It can be seen that the QAR data item is used as the feature value to assign weights through EWM, and the PCA dimensionality reduction method finally uses the LS-SVM algorithm to fill in the missing data of the QAR to great effect. However, since most of the routes sailed by the aircraft are repeated flights of the same route, when faced with multiple losses or overall losses, the same method can be used to simulate the historical data to restore the lost flight data.

Conclusions
The previous data processing experience is based on the QAR itself to detect changes in the body or environment and other actual conditions. Few studies have been conducted on the preservation and restoration of the QAR data itself. This work provides some ideas in this regard. In this paper, the improved LS-SVM method based on the entropy weight method (EWM) and principal component analysis (PCA) is shown to effectively fit the missing QAR data. The parameters are gradually stable during the training process, which ensures that the model can be directly applied for data fitting without retraining, achieving the purpose of fast and simple applicability. This article only considers the case of single item loss, since most of the aircraft sailing on the same route repeats the flight; when faced with multiple losses or overall loss, the same method can be used to simulate historical data to restore this loss of flight data.
Due to the uniqueness of flying at high plateaus, there may be differences when flying on normal routes and the same conclusion may not be applicable for the normal flight. Its practical applicability remains to be further studied.
Author Contributions: Conceptualization, N.C. and Y.S.; data curation, N.C. and Z.W.; methodology, N.C. and C.P.; formal analysis, N.C. and Z.W.; writing-original draft preparation, N.C. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data used to support the findings of this study are included within the article.