Prediction of In-Cylinder Pressure of Diesel Engine Based on Extreme Gradient Boosting and Sparrow Search Algorithm

: In-cylinder pressure is one of the most important references in the process of diesel engine performance optimization. In order to acquire effective in-cylinder pressure value, many physical tests are required. The cost of physical testing is high; various uncertain factors will bring errors to test results, and the time of an engine test is so long that the test results cannot meet the real-time requirement. Therefore, it is necessary to develop technology with high accuracy and a fast response to predict the in-cylinder pressure of diesel engines. In this paper, the in-cylinder pressure values of a high-speed diesel engine under different conditions are used to train the extreme gradient boosting model, and the sparrow search algorithm—which belongs to the swarm intelligence optimization algorithm—is introduced to optimize the hyper parameters of the model. The research results show that the extreme gradient boosting model combined with the sparrow search algorithm can predict the in-cylinder pressure under each veriﬁcation condition with high accuracy, and the proportion of the samples which prediction error is less than 10% in the validation set is 94%. In the process of model optimization, it is found that compared with the grid search method, the sparrow search algorithm has stronger hyper parameter optimization ability, which reduces the mean square error of the prediction model by 27.99%.


Introduction
As a stable and efficient power source, diesel engine plays an important role in industry, agriculture and transportation. Since the advent of the world's first diesel engine, researchers have been committed to improving the performance of the diesel engine to meet more severe application conditions. The combustion condition in the cylinder is directly related to the power output and emission level of the diesel engine. In order to analyze and optimize the combustion process of the diesel engine, the most commonly used method is to measure the in-cylinder pressure. By analyzing the heat release rate according to the in-cylinder pressure, the variation characteristics of many parameters in the combustion process can be acquired. In the development and calibration stage of diesel engines, the cylinder pressure is a very valuable reference indicator, which is of great significance to improve power and economy, reduce noise and emissions and reduce failure probability for engines [1]. Frank Willems proposed that real-time closed-loop control of in-cylinder pressure is one of the effective methods to achieve efficient and clean combustion of diesel engines in the future [2], and the control of combustion phase and heat release is the key to ensure stable and efficient operation of engines. Marcus Klein et al. proposed four real-time estimation methods of a compression ratio based on an in-cylinder pressure track, and used the estimation method to evaluate the simulation cycle and test cycle, which improved the stability of the variable compression ratio engine [3]. A.J. Torregosa et al. proposed and verified a method for diagnosing noise sources by extracting appropriate The schematic diagram of the experimental set-up is presented in Figure 1.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 3 of 17 the in-cylinder pressure data of the diesel engine under different steady-state conditions through bench tests. The extreme gradient boosting model in ensemble learning is trained with the in-cylinder pressure data. Considering that there are many hyper parameters of the prediction model and the adjustment process is very complex, in order to simplify the process of hyper parameter adjustment and improve the prediction accuracy of the model, the sparrow search algorithm in a swarm intelligence optimization algorithm was used to optimize the hyper parameters of the model.

Experiment and Data Acquisition
The test object of this study is a supercharged and intercooled high-speed diesel engine. The detailed engine specifications are summarized in Table 1. The schematic diagram of the experimental set-up is presented in Figure 1.  In the steady-state tests of the diesel engine, it was necessary to control the variables in the tests. The intake air temperature was maintained at (25 ± 2) °C by the air conditioner, the air humidity was maintained at ~50%, and the air intake pressure was (101 ± 1) kPa. The exhaust pressure of the engine was maintained at (10 ± 0.5) kPa. The cooling mode of Figure 1. Schematic of experimental set-up: 1 fuel tank; 2 fuel rail; 3 pressure sensor; 4 fuel filter; 5 fuel consumption meter 6 high pressure pump; 7 electric motor; 8 PC and control unit; 9 air flow meter; 10 intercooler; 11 air filter; 12 dynamometer; 13 crankshaft; 14 piston; 15 cylinder pressure sensor; 16 charge amplifier; 17 combustion analyzer; 18 gas analyzer; 19 smoke meter; 20 PC and control unit.
In the steady-state tests of the diesel engine, it was necessary to control the variables in the tests. The intake air temperature was maintained at (25 ± 2) • C by the air conditioner, the air humidity was maintained at~50%, and the air intake pressure was (101 ± 1) kPa. The exhaust pressure of the engine was maintained at (10 ± 0.5) kPa. The cooling mode of the engine was water cooling, and the cooling water temperature was maintained at (85 ± 5) • C. The fuel used in the tested engine was China VI 0# diesel. During the tests, the Kistler 6125c cylinder pressure sensor was used for cylinder pressure measurement.
The installation position of the cylinder pressure sensor was located in the cylinder head of the first cylinder and connected with the charge amplifier. The pressure signal was amplified by the charge amplifier and transmitted to the combustion analyzer. At the same time, the Kistler angle scale was used to identify the TDC (Top Dead Center) and CA (crank angle) signal. The measuring range of the Kistler 6125c cylinder pressure sensor is 0~300 bar and the deviation is ±1%. In this study, the in-cylinder pressure was collected every 0.5 • crank angle, variation of crank angle from −360 to 360 • , and a total of 1441 samples were collected in a single engine cycle. The in-cylinder pressure value with a crank angle was recorded from 100 engine cycles under each operating condition. The main instrumentation specifications used on the test bench are summarized in Table 2. The selected operating conditions are shown in Table 3. The in-cylinder pressure values under 30 operating conditions were collected. XGB (Extreme Gradient Boosting) trains multiple decision trees in series. Each decision tree learns from the previous decision tree and generates the final prediction result by synthesizing the decision values of all weak learners. XGB expands the loss function using a second-order Taylor series and introduces the regular term to avoid overfitting of the model [25]. Figure 2 is the schematic diagram of XGB model.
Each round of training in boosting will add a new function to the model. The objective function is shown as Equation (1).
where t is the number of training rounds; f t (x i ) represents the t-th regression tree; Ω( f t ) is the penalty term and constant is the constant term. Expand the objective function with Taylor series, and the result is shown in Equation (2): where g i is the first derivative, h i is the second derivative.  Each round of training in boosting will add a new function to the model. The objective function is shown as Equation (1).
where is the number of training rounds; ( ) represents the t-th regression tree; ( ) is the penalty term and is the constant term. Expand the objective function with Taylor series, and the result is shown in Equation where is the first derivative, ℎ is the second derivative. The penalty term is defined as Equation (3): where is the number of leaf nodes and represents the weight of the j-th leaf node. The objective function can be reduced to Equation (4):

Sparrow Search Algorithm
SSA (Sparrow Search Algorithm) is a new swarm intelligence optimization algorithm. Its design inspiration comes from the group foraging behavior of the sparrow population in nature. Individuals in the sparrow population adapt to the environment by constantly adjusting their distribution position, so as to obtain better food resources and avoid the attack of predators [26]. The sparrow search algorithm has been shown to outperform The penalty term is defined as Equation (3): where T is the number of leaf nodes and w j represents the weight of the j-th leaf node. The objective function can be reduced to Equation (4):

Sparrow Search Algorithm
SSA (Sparrow Search Algorithm) is a new swarm intelligence optimization algorithm. Its design inspiration comes from the group foraging behavior of the sparrow population in nature. Individuals in the sparrow population adapt to the environment by constantly adjusting their distribution position, so as to obtain better food resources and avoid the attack of predators [26]. The sparrow search algorithm has been shown to outperform many traditional population intelligence optimization algorithms in terms of its ability to find the best and avoid being trapped in local extremes [27,28]. The mathematical model of SSA is as follows: In the simulated population, assuming that the virtual sparrow is foraging, the sparrow population composed of N sparrows can be represented by matrix (5): where n signifies the number of all sparrows in the population and d describes the dimension of the decision variables. The fitness values of all sparrows can be expressed by Equation (6): Sparrow populations are divided into producers and scroungers. Producers have higher energy reserves and are responsible for searching for areas with more food, providing foraging areas and directions for scroungers. When individual sparrows detect predators, they will sound an alarm signal. If the alarm value is higher than the safety value, the producers will take the scroungers to a safe area for foraging. In the iterative process of the algorithm, the update rule of the producer's position is shown in Equation (7): where t is the current number of iterations, X ij is the location information of the sparrows, α is a random number with a value range of [0, 1], M is the maximum number of iterations, Q is a random number which obeys to normal distribution, and L is a 1 × d matrix, in which the elements are all 1.
, respectively represent safety value and alarm value; when R 2 < ST, there is no predator invasion, and the producers can carry out a wide range of search operations. When R 2 ≥ ST, it means that the individuals in the population have detected the predators, and all sparrows need to fly to a safe area immediately. Scroungers will keep an eye on the producers. Once the producers find a better foraging area, the scroungers will immediately compete with them. If the scroungers win, they will seize the resources from producers instantly. The rule for updating the scrounger's location is as follows: where X p is the best position occupied by the current producers, X worst is the global worst position, and r. A describes a 1 × d vector such that the elements are randomly assigned 1 or −1, A + = A T AA T −1 , and n is the total number of sparrows in the populations. When i > n/2, it means that the i-th scroungers with low fitness do not get any food and need to fly to other areas for foraging. The initial positions of the sparrows which are aware of the danger are as follows: where, β signifies a normal distributed random value with a mean value of 0 and a variance of 1. ε is the smallest constant for avoiding from zero-division-error. K ∈ [−1, 1], is also a random number. f i is the fitness of the current individuals. f g and f w represent the current global best and worst fitness, respectively. When f i > f g , the sparrow is in a marginal position and vulnerable to predators. f i = f g indicates that the sparrows in the population are aware of the danger and need to be close to other sparrows to avoid being caught by predators. Figure 3 represents the iterative flow chart of SSA.

Model Establishment
Python language was used in the process of model building, the compilation environment is Pycharm, and the python libraries used mainly include scikit learn, pandas, numpy, Matplotlib, etc.

Input and Output Selection
Extreme gradient boosting belongs to supervised learning. During the training process of the model, input features and the output label of the model need to be determined. In order to realize the prediction of in-cylinder pressure under specific operating conditions of diesel engine, the in-cylinder pressure was selected as the output label for the models, and the excess air coefficient, speed, torque, power, fuel consumption and crank angle that could represent the characteristics of the operating conditions were chosen as the input features.

Split and Preprocessing of Datasets
Through the steady-state operating condition tests, the data under 30 operating conditions were acquired. Each condition contained 1441 samples of in-cylinder pressure. In this paper, the in-cylinder pressure values from 6 conditions were selected as the validation set to prove the predictive performance of the model. The validation operating conditions are represented in Table 4. In order to facilitate the description of these operating conditions in the later sections, they are numbered 1 to 6, respectively. A total of 34,584 samples from the remaining 24 operating conditions were randomly divided by the ratio of 8:2, of which 80% of the samples were used as the training set to train the model, and 20% of the samples were used as the test set. In order to eliminate the dimensional differences between different features and reduce training cost of the model, it is necessary to process the original data. The preprocessing method selected in this study is normalization, which can render each feature dimensionless and scale the values in the range of [0, 1]. The normalization method is as in Equation (10): where x is the original data and x min is the minimum value of the feature; x max is the maximum value of the feature;x represents the data after normalization. The data description after preprocessing is shown in Table 5.

Evaluation Criteria of the Model
In statistics, there are various statistical metrics used to evaluate the prediction performance of the model. This paper used four common metrics. These metrics are Mean Square Error (MSE), Root Mean Squares Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (R 2 ). The equations and performance criteria of these metrics are shown in Table 6. Table 6. Description of evaluation metrics 1 .

Metric Equation Performance
Criteria The smaller the MSE value, the higher the prediction accuracy of the model. The value range of MSE is [0, +∞].
The value range of R 2 is [0, 1]. The closer it is to 1, the stronger the model's ability to explain the predicted object. The closer it is to 0, the worse the fit of the model. 1ŷ i is the predicted value, y i is the true value and y is the average of the true values.

Predictive Performance of the Initialized Model
The selection of hyper parameters has a significant influence on the predictive performance of machine learning models. However, there is currently no relevant theoretical support for hyper parameter selection, and the adjustment process of hyper parameters is usually extremely cumbersome. The main hyper parameters that need to be adjusted for the XGB model and their meanings are shown in Table 7. First, the predictive performance of the initialized XGB was analyzed. The five hyper parameters of the model, max_depth, n_estimators, eta, min_child_weight, and gamma, were set to 2, 100, 0.1, 3, and 0.1, respectively. Figure 4 shows the results of the regression analysis of the initialized model, in which the blue scatter points represent the results of prediction for the training set, the orange scatter points represent the results of prediction for the test set, and the black straight line represents the 45 • line where the predicted values are equal to the test values. The number of samples in the training set is 27,667 and the number of samples in the test set is 6917 according to the data set division ratio described in Section 2.3.2. It can be seen in Figure 4 that the prediction performance of the initialized model is poor, and the prediction results for samples with larger values are extremely inaccurate.  In order to comprehensively evaluate the performance of the initialized model, the data set was randomly divided in the ratio of 8:2, and a total of 100 training and testing processes were performed on the model, and then the evaluation results of each metric were calculated and recorded; the results are shown in Figure 5.

Predictive Performance of the Optimized Model
The predictive performance of the initialized model was poor and insufficient for the purpose of in-cylinder pressure prediction, so the hyper parameters of the model needed to be optimized. The upper and lower bounds of the hyper parameters needed to be set before the optimization of the prediction model using the SSA. The upper and lower bounds of the five hyper parameters were empirically set to (10, 1000, 0.3, 10, 0.3) and (1, In order to comprehensively evaluate the performance of the initialized model, the data set was randomly divided in the ratio of 8:2, and a total of 100 training and testing processes were performed on the model, and then the evaluation results of each metric were calculated and recorded; the results are shown in Figure 5.  In order to comprehensively evaluate the performance of the initialized model, the data set was randomly divided in the ratio of 8:2, and a total of 100 training and testing processes were performed on the model, and then the evaluation results of each metric were calculated and recorded; the results are shown in Figure 5.

Predictive Performance of the Optimized Model
The predictive performance of the initialized model was poor and insufficient for the purpose of in-cylinder pressure prediction, so the hyper parameters of the model needed to be optimized. The upper and lower bounds of the hyper parameters needed to be set before the optimization of the prediction model using the SSA. The upper and lower

Predictive Performance of the Optimized Model
The predictive performance of the initialized model was poor and insufficient for the purpose of in-cylinder pressure prediction, so the hyper parameters of the model needed to be optimized. The upper and lower bounds of the hyper parameters needed to be set before the optimization of the prediction model using the SSA. The upper and lower bounds of the five hyper parameters were empirically set to (10, 1000, 0.3, 10, 0.3) and (1, 100, 0.01, 1,  0.01), respectively. The dimension of the hyper parameters which need to be optimized was five, and the number of sparrow populations was set to 100. The fitness value was the sum of the MSE of the training set and test set; the fitness function is shown in Equation (11): The optimization trajectory of SSA is shown in Figure 6. According to the optimization trajectory, it can be seen that the value of the optimal fitness of the population continued to decrease as the number of iterations increased, and the calculation converged when the number of iterations reached 10, which implies that the optimal value was found. The value of the minimal MSE was 0.05688, and the corresponding values of the five hyper parameters of the model were 8, 1000, 0.0688, 4.8015, and 0.01, respectively.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 12 of The optimization trajectory of SSA is shown in Figure 6. According to the optimiz tion trajectory, it can be seen that the value of the optimal fitness of the population co tinued to decrease as the number of iterations increased, and the calculation converg when the number of iterations reached 10, which implies that the optimal value was foun The value of the minimal MSE was 0.05688, and the corresponding values of the five hyp parameters of the model were 8, 1000, 0.0688, 4.8015, and 0.01, respectively.       Figure 7 shows the results of the regression analysis of SSA-XGB. Both the training set and test set in the figure overlap with the diagonal line and achieve a good fit, indicating that the predictive performance of the model was significantly improved after the optimization of the hyper parameters.    Grid search is one of the most basic hyper parameter optimization algorithms. The basic principle is to adjust the parameters sequentially in steps within the specified parameters range, and use the adjusted parameters to train the prediction model until the optimal hyper parameters are found. Compared to swarm intelligence optimization algorithms, a traditional grid search method takes more computation time and may not always find the extremum of the objective function. Setting the step size of each hyper parameter of the model as (1,10,0.001,0.1,0.001), and then using the grid search to find the optimization of the hyper parameters, the minimum MSE of the model was 0.08077. Table 8 presents the hyper parameters and MSE of initialized XGB, grid search-XGB and SSA-XGB; the MSE of the SSA-XGB model was reduced by 27.99% compared to use grid search method. Grid search is one of the most basic hyper parameter optimization algorithms. The basic principle is to adjust the parameters sequentially in steps within the specified parameters range, and use the adjusted parameters to train the prediction model until the optimal hyper parameters are found. Compared to swarm intelligence optimization algorithms, a traditional grid search method takes more computation time and may not always find the extremum of the objective function. Setting the step size of each hyper parameter of the model as (1,10,0.001,0.1,0.001), and then using the grid search to find the optimization of the hyper parameters, the minimum MSE of the model was 0.08077. Table 8 presents the hyper parameters and MSE of initialized XGB, grid search-XGB and SSA-XGB; the MSE of the SSA-XGB model was reduced by 27.99% compared to use grid search method.

Prediction Results of the Validation Set
The validation set contains in-cylinder pressure data from six different operating conditions, of which there are 8646 samples. Using the optimized model to predict the validation set, the regression analysis is shown in Figure 9. As can be seen in Figure 9, most of the samples lie around the diagonal line, and only a few validation samples with larger values deviate, which means that the model obtained very accurate prediction results on the validation set. Grid search is one of the most basic hyper parameter optimization algorithms. The basic principle is to adjust the parameters sequentially in steps within the specified parameters range, and use the adjusted parameters to train the prediction model until the optimal hyper parameters are found. Compared to swarm intelligence optimization algorithms, a traditional grid search method takes more computation time and may not always find the extremum of the objective function. Setting the step size of each hyper parameter of the model as (1,10,0.001,0.1,0.001), and then using the grid search to find the optimization of the hyper parameters, the minimum MSE of the model was 0.08077. Table 8 presents the hyper parameters and MSE of initialized XGB, grid search-XGB and SSA-XGB; the MSE of the SSA-XGB model was reduced by 27.99% compared to use grid search method.  In order to represent the prediction results of the validation set more intuitively, the predicted values of each validation operating condition were fit to the actual values; the results are shown in Figure 10. The horizontal axis is the crank angle and the vertical axis is the in-cylinder pressure value. The black curve represents the actual value acquired in the tests, and the rest of the curves in different colors are the predicted values of the prediction model. As can be seen in Figure 10, the predicted values from operating condition 1, 2, and 4 are in good agreement with the actual values, and the predicted in-cylinder pressure values at the rest of the operating conditions only deviate from the actual values in the peak region. Figure 11 shows the results of the error analysis for the validation conditions. The horizontal axis of the figure is the crank angle, the vertical axis is the error value, the black horizontal line indicates 10% error line, and the colored dashes represent the specific error between the predicted and actual values for all samples in different validation operating conditions. From Figure 11a,b, it can be seen that the error between the predicted and actual values for all samples in the validation condition 1, 2, and 4 is less than 10%, and the prediction error in the range of 0~180 • CA ATDC is relatively larger. There are more samples with prediction errors greater than 10% in condition 5 and 6. After counting, the percentage of samples with all prediction errors below 10% in the validation set is 94%.   Figure 11 shows the results of the error analysis for the validation conditions. The horizontal axis of the figure is the crank angle, the vertical axis is the error value, the black horizontal line indicates 10% error line, and the colored dashes represent the specific error between the predicted and actual values for all samples in different validation operating conditions. From Figure 11a,b, it can be seen that the error between the predicted and actual values for all samples in the validation condition 1, 2, and 4 is less than 10%, and the prediction error in the range of 0~180° CA ATDC is relatively larger. There are more samples with prediction errors greater than 10% in condition 5 and 6. After counting, the percentage of samples with all prediction errors below 10% in the validation set is 94%.

Conclusions
In this study, we acquired the in-cylinder pressure of a high-speed diesel engine under different steady-state operating conditions. In order to predict the in-cylinder pressure, we introduced the extreme gradient boosting model of ensemble learning, and used the sparrow search algorithm to optimize the hyper parameters of the prediction model. The research results show that the SSA-XGB model can accurately predict the in-cylinder pressure values. The percentage of samples with a prediction error less than 10% in the validation set was 94%. XGB has many hyper parameters and the parameters adjustment pro-

Conclusions
In this study, we acquired the in-cylinder pressure of a high-speed diesel engine under different steady-state operating conditions. In order to predict the in-cylinder pressure, we introduced the extreme gradient boosting model of ensemble learning, and used the sparrow search algorithm to optimize the hyper parameters of the prediction model. The research results show that the SSA-XGB model can accurately predict the in-cylinder pressure values. The percentage of samples with a prediction error less than 10% in the validation set was 94%. XGB has many hyper parameters and the parameters adjustment process is complicated, but hyper parameter optimization must be performed in order to improve the model performance. In this paper, the optimization capability of SSA was demonstrated, and the MSE of the model was reduced by 27.99% after SSA optimization compared to use the grid search method. Data Availability Statement: The dataset generated and analyzed during the current study is available from the corresponding author on reasonable request.