Soft Sensor Modeling Method for the Marine Lysozyme Fermentation Process Based on ISOA-GPR Weighted Ensemble Learning

Due to the highly nonlinear, multi-stage, and time-varying characteristics of the marine lysozyme fermentation process, the global soft sensor models established using traditional single modeling methods cannot describe the dynamic characteristics of the entire fermentation process. Therefore, this study proposes a weighted ensemble learning soft sensor modeling method based on an improved seagull optimization algorithm (ISOA) and Gaussian process regression (GPR). First, an improved density peak clustering algorithm (ADPC) was used to divide the sample dataset into multiple local sample subsets. Second, an improved seagull optimization algorithm was used to optimize and transform the Gaussian process regression model, and a sub-prediction model was established. Finally, the fusion strategy was determined according to the connectivity between the test samples and local sample subsets. The proposed soft sensor model was applied to the prediction of key biochemical parameters of the marine lysozyme fermentation process. The simulation results show that the proposed soft sensor model can effectively predict the key biochemical parameters with relatively small prediction errors in the case of limited training data. According to the results, this model can be expanded to the soft sensor prediction applications in general nonlinear systems.


Introduction
Marine lysozyme has the characteristics of low action temperature, wide pH range, strong activity at room temperature, and moderate decrease in activity as temperature decreases [1,2].It brings new energy and opportunities to industries such as cleaning, medicine, environmental protection, and food processing [3,4].Therefore, it is necessary to dynamically regulate and optimize the marine lysozyme fermentation process in real-time to maximize its production efficiency and product quality.However, the marine lysozyme fermentation process is a multivariate, time-varying, and complex nonlinear process.Due to practical process technology and cost considerations, key biochemical parameters that directly reflect the fermentation quality, such as cell concentration, substrate concentration, and relative enzyme activity, can only be roughly estimated through offline sampling and analysis.This process not only affects the operator's ability to make accurate decisions regarding real-time response status, but also limits the implementation of the best control method.Therefore, it is urgent to find a method that can predict the key biochemical parameters of marine lysozyme fermentation accurately in real time.
Soft sensor technology is an effective method to solve the above problems [5-9].Hua et al. [10] proposed a new hybrid soft sensor model based on RF-IHHO-LSTM (random forest-improved Harris hawks optimization-long short-term memory) for the penicillin fermentation process.The simulation results show that the established soft sensor model has Sensors 2023, 23, 9119 2 of 14 higher measurement accuracy and better effect, which can meet the practical requirements of the project.Wang et al. [11] constructed a multi-output least squares support vector machine (MLSSVM) regressor model to solve the problem of multi-input and multi-output for l-lysine.They also introduced the Improved Cuckoo Search (ICS) algorithm to optimize the essential parameters of the model.Finally, the hybrid ICS-MLSSVM soft-sensor model was used to predict lysine key parameters online.The simulation results show that the proposed regression model could accurately predict key biochemical parameters.Tokuyama et al. [12] developed a novel soft sensor model for estimating substrates, bacterial cells, and the concentration of target products in commercial fermenters.The results suggest that the machine learning-based soft sensor model could represent a novel monitoring system for digital transformation in the biotechnology process field.Wang et al. [13] used an artificial neural network model to develop soft sensors to monitor lipolytic yeast's microbial lipid fermentation process.The results show that this model offers the possibility of monitoring stem cell weight, glucose concentration, and lipids online with high accuracy.Sun et al. [14] developed a SOM-LSSVM (SOM, self-organizing feature map; LSSVM, least squares support vector machine) global modeling method for predicting the fermentation potency of CTC (CTC, potency of chlortetracycline).Field experiments show that the method could obtain more accurate potency prediction values.
Although the above modeling methods can meet the basic requirements of key biochemical parameters for online prediction, how can the prediction accuracy of the model be further improved?Due to the characteristics of multiple operating conditions, strong nonlinearity, and high uncertainty in the biological fermentation process, sample data under different operating conditions often have significant differences.Therefore, traditional global soft sensor models constructed using a single soft sensor modeling method cannot accurately describe the dynamic characteristics of the entire fermentation process, resulting in low prediction accuracy and poor generalization ability of the model.This makes it challenging for global soft sensor models to describe the multi-stage nature of the fermentation process, so these applications cannot guarantee prediction accuracy in the global scope.Some scholars suggest applying ensemble learning to this issue.Ensemble learning is an advanced machine learning method that combines different fusion strategies with basic models to achieve more accurate predictions.The basic idea is that even if a weak base model obtains an incorrect prediction, other weak base models can still correct this error.Usually, ensemble learning has a more substantial generalization capability than the base model.Due to its flexible adaptability, ensemble learning has been successfully applied in various fields.Shen et al. [15] proposed a new method based on stochastic programming to realize a quality-related monitoring scheme for batch processing of multiple output modes through ensemble learning.Wang et al. [16] established a prediction model for rumen fermentation parameters in dairy cows by a stacked ensemble learning method and in vitro technique.The comparison results show that the stacking ensemble learning method had better prediction results.Shen et al. [17] proposed a multivariate trajectory based on an ensemble punctual learning strategy to realize a batch quality prediction scheme for the problem of batch diversity.The literature indicates that the modeling method based on ensemble learning theory frameworks can improve the accuracy and generalization ability of a single global soft sensor models.It can be considered for the prediction of key biochemical parameters in the fermentation process.
Considering the excellent characteristics of ensemble learning and the nonlinear, multivariate, and multi-stage features of the marine lysozyme fermentation process, this paper proposes a weighted ensemble learning soft sensor modeling method based on an improved seagull optimization algorithm and Gaussian process regression [18][19][20][21] (ISOA-GPR).The structure of ISOA-GPR weighted ensemble learning is shown in Figure 1.Firstly, an improved density peak clustering algorithm (ADPC) is used to partition subsets of local samples and generate ISOA-GPR sub-prediction models.Then, the improved grayscale correlation algorithm is used to extract the centroids of each local sample subset.The centroid is weighted by information entropy to generate a weighted "centroid" that more Sensors 2023, 23, 9119 3 of 14 accurately represents the local sample subset features.Finally, a fusion strategy based on weighted improved grayscale correlation algorithm is proposed by selecting the subprediction models that highly correlated with the test samples.Applying the constructed soft sensor model to the problem of predicting bacterium concentration, substrate concentration, and relative enzyme activity in marine lysozyme fermentation, the simulation results show that compared with the single global soft sensor model based on ISOA-GPR, the prediction error and volatility of this model are smaller.
Firstly, an improved density peak clustering algorithm (ADPC) is used to partition subs of local samples and generate ISOA-GPR sub-prediction models.Then, the improv grayscale correlation algorithm is used to extract the centroids of each local sample sub The centroid is weighted by information entropy to generate a weighted "centroid" t more accurately represents the local sample subset features.Finally, a fusion strate based on weighted improved grayscale correlation algorithm is proposed by selecting sub-prediction models that highly correlated with the test samples.Applying constructed soft sensor model to the problem of predicting bacterium concentrati substrate concentration, and relative enzyme activity in marine lysozyme fermentati the simulation results show that compared with the single global soft sensor model ba on ISOA-GPR, the prediction error and volatility of this model are smaller.
The basic structure of ensemble learning fusion strategy Figure 1.The structure of ISOA-GPR weighted ensemble learning for marine lysozyme fermentat process.

Data Subsets Construction Method
In response to the data distribution of marine lysozyme fermentation, this pa proposes an improved density peak clustering algorithm (ADPC) to partition the d subsets.This density-dependent classification method evaluates the similarity betw samples of data and can be used to cluster datasets of arbitrary shapes.Density p clustering (DPC) is a typical method based on density clustering [22].This algorit requires each data point on which classification relies to have two eigenvalues: lo density i  and relative distance i  .It assumes that the cluster center has a m significant local density and a larger relative distance from other cluster centers than ot data points.
Relative distance refers to the minimum distance between a sample point and ot points of higher density.For the sample set R , the local density

Data Subsets Construction Method
In response to the data distribution of marine lysozyme fermentation, this paper proposes an improved density peak clustering algorithm (ADPC) to partition the data subsets.This density-dependent classification method evaluates the similarity between samples of data and can be used to cluster datasets of arbitrary shapes.Density peak clustering (DPC) is a typical method based on density clustering [22].This algorithm requires each data point on which classification relies to have two eigenvalues: local density ρ i and relative distance δ i .It assumes that the cluster center has a more significant local density and a larger relative distance from other cluster centers than other data points.
Relative distance refers to the minimum distance between a sample point and other points of higher density.For the sample set R, the local density ρ i of data x i is where x i and x j represent the i th and j th data points, respectively.dist ij is the distance between data x i and x j , and dist c is the truncation distance.
Due to the higher local density and relative distance of the DPC algorithm's clustering center compared to other data points, a multiplication of the two is used to select the clustering center.If the cluster centers are nearby, it is not easy to accurately select them by continuing to use the above foundation.Therefore, this paper chooses a logarithmic function to emphasize the differences between the clustering center and other data points.The process is shown below.Define a decision parameter D i that combines local density, relative distance, and logarithmic functions.Then, name the decision parameters γ i in descending order and deduce the downward trend according to Equation (3): where γ i represents the current γ value and γ i−1 γ i+1 represent the γ values at the preceding and subsequent places, respectively.According to the distribution of downward trend, select the data points with the highest downward trend and those before it as the clustering centers.The flowchart of the ADPC algorithm is shown in Figure 2.
select the clustering center.If the cluster centers are nearby, it is not easy to accurately select them by continuing to use the above foundation.Therefore, this paper chooses a logarithmic function to emphasize the differences between the clustering center and other data points.The process is shown below.Define a decision parameter i D that combines local density, relative distance, and logarithmic functions.Then, name the decision parameters i  in descending order and deduce the downward trend according to Equation (3): ( ) where i  represents the current  value and  + represent the  values at the preceding and subsequent places, respectively.According to the distribution of downward trend, select the data points with the highest downward trend and those before it as the clustering centers.The flowchart of the ADPC algorithm is shown in Figure 2. Applying this method to actual marine lysozyme data samples, the descending distribution of decision parameters and the downward trend of parameters i  are obtained, as shown in Figures 3 and 4, respectively.From Figure 3, it can be seen that there is often a significant difference in decision parameters between clustering and nonclustered centers.Except for the first few data points, the decision parameters of the other data points have little fluctuation.They are not suitable for being selected as clustering centers.Applying this method to actual marine lysozyme data samples, the descending distribution of decision parameters and the downward trend of parameters γ i are obtained, as shown in Figures 3 and 4, respectively.From Figure 3, it can be seen that there is often a significant difference in decision parameters between clustering and non-clustered centers.Except for the first few data points, the decision parameters of the other data points have little fluctuation.They are not suitable for being selected as clustering centers.

Sub-Prediction Model Construction
Considering the obvious nonlinear characteristics of the marine lysozyme fermentation process, small data sample size, and difficulty in offline extraction, etc., this paper selects the Gaussian process regression method, which is good at predicting complex nonlinear outputs using small sample data, to establish a prediction model for the marine lysozyme fermentation process [23].For Gaussian process regression models, the selection of hyperparameters has a significant impact on the accuracy of the prediction mode precision.Traditional parameter selection methods rely on empirical and trial and error, making it difficult to ensure regression accuracy and calculation speed.In order to generate a sub-prediction model with better performance, this paper utilizes the improved seagull optimization algorithm (ISOA) to optimize and adjust hyperparameters online.

Improved Seagull Optimization Algorithm
SOA is an intelligent algorithm that simulates the behavior of seagull flocks in nature.This algorithm solves the spatial iterative optimization problem by simulating the seasonally varying long-range migratory behavior and spiral attack behavior of seagulls [24].Compared with other optimization algorithms, the algorithmic structure of the SOA method is simpler, more stable, and more adaptable.It only has one modifiable parameter in the actual optimization process.
In traditional seagull optimization algorithms, the weight parameters linearly decrease with the increase in iteration times.Although the algorithm executes quickly, each iteration can easily lead to a decrease in population diversity.It is easy to have the

Sub-Prediction Model Construction
Considering the obvious nonlinear characteristics of the marine lysozyme fermentation process, small data sample size, and difficulty in offline extraction, etc., this paper selects the Gaussian process regression method, which is good at predicting complex nonlinear outputs using small sample data, to establish a prediction model for the marine lysozyme fermentation process [23].For Gaussian process regression models, the selection of hyperparameters has a significant impact on the accuracy of the prediction mode precision.Traditional parameter selection methods rely on empirical and trial and error, making it difficult to ensure regression accuracy and calculation speed.In order to generate a sub-prediction model with better performance, this paper utilizes the improved seagull optimization algorithm (ISOA) to optimize and adjust hyperparameters online.

Improved Seagull Optimization Algorithm
SOA is an intelligent algorithm that simulates the behavior of seagull flocks in nature.This algorithm solves the spatial iterative optimization problem by simulating the seasonally varying long-range migratory behavior and spiral attack behavior of seagulls [24].Compared with other optimization algorithms, the algorithmic structure of the SOA method is simpler, more stable, and more adaptable.It only has one modifiable parameter in the actual optimization process.
In traditional seagull optimization algorithms, the weight parameters linearly decrease with the increase in iteration times.Although the algorithm executes quickly, each iteration can easily lead to a decrease in population diversity.It is easy to have the problem of weak global search ability in the early stage of the algorithm and poor local mining ability in the late stage of the algorithm.So, this paper proposes a nonlinear change in the weight parameter updating strategy.The specific expression is as follows: where t is the current number of iterations, Max iteration is the maximum number of iterations, and f C is a constant whose initial value is set to 2. The improved weight parameters first decrease rapidly as the number of iterations increases, and then gradually and slowly decrease.In the early iterations of the improved seagull optimization algorithm, the weight parameters suddenly decrease to maintain population diversity, which can also improve its global search capability.In the later stages of execution, the weight parameters gradually decrease, thereby improving the local search ability, which also ensures that the algorithm is not easily trapped in a local optimum.Therefore, using an improved seagull optimization algorithm to optimize hyperparameters will undoubtedly result in a more accurate soft sensor mode.The pseudocode of ISOA algorithm is given in Algorithm 1.

Sub-Prediction Model Selection and Fusion Strategy
Fusion strategy is an important component of ensemble learning, and a good fusion strategy is the key to demonstrating its superior performance.This paper adopts a fusion strategy based on improved grayscale correlation to determine the weights of sub-prediction models.This fusion strategy finds the weighted "centroid" of every local sample subset, which can best represent the feature of the entire data subset.Then, a better sub-prediction model is selected based on the correlation between the test sample and the weighted "centroid".The correlation coefficient between the test samples and the weighted "centroid" is analyzed by using the improved grayscale correlation algorithm because the improved grayscale correlation algorithm can more accurately reflect the fluctuation between the marine lysozyme fermentation data sequences.Given a local sample subset of the marine lysozyme fermentation process data r = {x i ; i = 1, 2, . . ., n}, where x i ∈ R d , n is the number of samples in each local sample subset and d is the feature variable's Sensors 2023, 23, 9119 7 of 14 dimensionality, let the reference sequence be x 0 = {x 0 (1), x 0 (2), . . . ,x 0 (d)} and calculate the grayscale correlation coefficient: where x i (k), and ρ indi- cates the resolution coefficient, which is taken as 0.5.The correlation between the reference and comparison sequences is calculated as follows: Let each sample of the local sample subset be the reference sequence and the remaining of that local sample subset samples be comparison sequences.The generated sample correlation matrix is computed as follows: The sample with the strongest correlation with all local sample subsets data is picked as the data set's initial center of mass.Its correlation coefficients with other samples in that local sample subset are reported to generate the correlation coefficient matrix: In this paper, the information entropy is used to characterize the degree of variation for each feature variable under the correlation coefficient matrix in a weighted manner.This process assigns objective weights to the feature variables and generates a weighted "centroid" that is more representative of the information in the local sample subset.In general, the lower a feature variable's information entropy, the larger its degree of variation and the higher its given weight.Conversely, when information entropy increases, the relevance of feature variables decreases, and weights decrease.The characteristic weight of the jnd characteristic variable of the ith sample is calculated as: The entropy value of the j characteristic variable: Then, the weights of each characteristic variable in the correlation coefficient matrix are obtained as follows: Assume that the improved density peak clustering method collects a total of m local sample subsets.The initial center of mass of the m subsets is named Z * to obtain the weighted "centroid" Z m (Z m = w j Z * ).Then, the correlation set ω = [ω 1 , ω 2 , ω 3 , . . . ,ω m ] is obtained by setting the fermentation test sample x * as the reference sequence and the weighted " centroid " of m local sample subsets as the comparison sequence.Generally speaking, ω * is picked as the critical correlation coefficient, which means that the ensemble learning will retain the ISOA-GPR sub-prediction models corresponding to correlation coefficients greater than or equal to ω * .Its corresponding fermentation process sub-prediction model result is y pre = y pre1 , y pre2 , y pre3 , . . ., y preη , η ∈ [1, m], so the final prediction result of grayscale correlation weighted ensemble learning is:

Modeling Process
The modeling flowchart of the ISOA-GPR Weighted Ensemble Learning is shown in Figure 5.To better illustrate the process of building the soft sensor model in this paper, the modeling process is described as follows.Step 2: Calculate the correlations degree between various environmental parameters and key biochemical parameters.Select environmental parameters with correlations greater than 0.7 as auxiliary variables to build ISOA-GPR sub-prediction models.
Step 3: Send the test sample ( x  ), calculate its grayscale correlation coefficient with each weighted "centroid" ( ( )

Simulation Results and Analysis
In order to validate the effectiveness of the proposed soft sensor modeling method, this paper simulates the data from the marine lysozyme fermentation process.Taking the marine lysozyme fermentation process as the object, the culture strain was S-12-86, and the fermenter model was A103-500L.The Yellow Sea Fisheries Research Institute of the Chinese Academy of Fisheries Sciences gave the marine lysozyme fermentation method, and the Jiangsu University fermentation control system platform provided the navigational lysozyme fermentation data.
In this paper, bacterium concentration (X), substrate concentration (S), and relative enzyme activity (P) were taken as the most key biochemical parameters in marine lysozyme fermentation.An improved grayscale correlation algorithm was used to filter auxiliary variables and extract data from 15 fermentation batches.The first 12 batches, a total of 720 data points, were used as training samples and the last 3 batches, with a total of 180 data points, were used as test samples.These measurements' values were used for Step 1: Obtain data on the marine lysozyme fermentation process through experiments, including major environmental parameters and key biochemical parameters (bacterium concentration, substrate concentration, relative enzyme activity).The improved density peak clustering algorithm is utilized to divide local sample subsets (R = {r 1 , r 2 , . . . ,r m }) as well as to calculate the weighted "centroid" Z m (Z m = w j Z * ) for each local sample subset.
Step 2: Calculate the correlations degree between various environmental parameters and key biochemical parameters.Select environmental parameters with correlations greater than 0.7 as auxiliary variables to build ISOA-GPR sub-prediction models.

Simulation Results and Analysis
In order to validate the effectiveness of the proposed soft sensor modeling method, this paper simulates the data from the marine lysozyme fermentation process.Taking the marine lysozyme fermentation process as the object, the culture strain was S-12-86, and the fermenter model was A103-500L.The Yellow Sea Fisheries Research Institute of the Chinese Academy of Fisheries Sciences gave the marine lysozyme fermentation method, and the Jiangsu University fermentation control system platform provided the navigational lysozyme fermentation data.
In this paper, bacterium concentration (X), substrate concentration (S), and relative enzyme activity (P) were taken as the most key biochemical parameters in marine lysozyme fermentation.An improved grayscale correlation algorithm was used to filter auxiliary variables and extract data from 15 fermentation batches.The first 12 batches, a total of 720 data points, were used as training samples and the last 3 batches, with a total of 180 data points, were used as test samples.These measurements' values were used for training simulations based on a single global ISOA-GPR model, an ISOA-BP-weighted ensemble learning soft sensor model and an ISOA-GPR-weighted ensemble learning soft sensor model.The simulation results are depicted in  To show that the ISOA-GPR weighted ensemble learning soft sensor model performs better, root mean square and maximum absolute errors compare how well the three models can predict.The results are displayed in Table 1.
where y (i) represents the values of all actual key biochemical parameters (bacterial concentration, substrate concentration, and relative enzyme activity) for the tested samples.ŷ(i) represents the values of all predicted key biochemical l parameters (bacterial concentration, substrate concentration, and relative enzyme activity) for the tested samples.
Sensors 2023, 23, x FOR PEER REVIEW 10 of 15 and maximum absolute errors compare how well the three models can predict.The results are displayed in Table 1.This paper established two sets of comparison experiments with ISOA-GPR single global, ISOA-GPR weighted ensemble learning and ISOA-GPR weighted ensemble learning; and ISOA-BP weighted ensemble learning.The predicted curves of key biochemical parameters (bacterium concentration, substrate concentration, and relative enzyme activity) for the marine lysozyme fermentation process were derived using each of the three models, as shown in Figures 6,8            From the above figures and table analysis the following can be seen.
(1) It can be concluded from Figure 6 that all three models can predict the bacterium concentration well, and that the curve trend of the prediction model and the actual data are basically the same.It can be obtained in Figure 7 that, except for some individual data points (e.g., the 15th data point), the ISOA-GPR weighted ensemble learning model has     From the above figures and table analysis the following can be seen.
(1) It can be concluded from Figure 6 that all three models can predict the bacterium concentration well, and that the curve trend of the prediction model and the actual data are basically the same.It can be obtained in Figure 7 that, except for some individual data points (e.g., the 15th data point), the ISOA-GPR weighted ensemble learning model has  This paper established two sets of comparison experiments with ISOA-GPR single global, ISOA-GPR weighted ensemble learning and ISOA-GPR weighted ensemble learning; and ISOA-BP weighted ensemble learning.The predicted curves of key biochemical parameters (bacterium concentration, substrate concentration, and relative enzyme activity) for the marine lysozyme fermentation process were derived using each of the three models, as shown in Figures 6,8  at the theoretical level.This not only enables online prediction of the key biochemical parameters in the marine lysozyme fermentation process, but also effectively improves its online prediction and tracking ability, solving the problem of poor prediction accuracy of traditional single global soft sensor models.At the same time, it provides new solutions for other complex nonlinear online prediction industries.Under ideal experimental conditions, the conclusion of this paper are obtained.However, in the actual fermentation process, fermentation conditions are prone to sudden changes.Minor differences in fermentation conditions will affect the predicted results.Therefore, the next stage of research is to solve the problem of how to apply this method to practical complex fermentation processes.
i th and j th data points, respectively.
truncation distance.Due to the higher local density and relative distance of the DPC algorithm clustering center compared to other data points, a multiplication of the two is used

Figure 1 .
Figure 1.The structure of ISOA-GPR weighted ensemble learning for marine lysozyme fermentation process.

Figure 4 .
Figure 4.The tendency for decision parameters

Figure 4 .
Figure 4.The tendency for decision parameters γ i to decrease.

Sensors 2023 , 15 (
23,  x FOR PEER REVIEW 9 of bacterium concentration, substrate concentration, relative enzyme activity).The improved density peak clustering algorithm is utilized to divide local sample subsets (
of all actual key biochemical parameters (bacterial concentration, substrate concentration, and relative enzyme activity) for the tested samples.( ) i y represents the values of all predicted key biochemical l parameters (bacterial concentration, substrate concentration, and relative enzyme activity) for the tested samples.
and 10.In order to clearly compare the prediction accuracy, the absolute error curves of the three key biochemical parameters predicted (bacterial concentration, substrate concentration, and relative enzyme activity) are shown inFigures 7, 9 and 11, respectively.

Figure 7 .
Figure 7. Error variation curve of bacterium concentration.

Figure 9 .
Figure 9. Error variation curve of substrate concentration.

Figure 10 .
Figure 10.Predicted curve of relative enzyme activity.

Figure 11 .
Figure 11.Error variation curve of relative enzyme activity.

Figure 9 .
Figure 9. Error variation curve of substrate concentration.

Figure 10 .
Figure 10.Predicted curve of relative enzyme activity.

Figure 11 .
Figure 11.Error variation curve of relative enzyme activity.
and 10.In order to clearly compare the prediction accuracy, the absolute error curves of the three key biochemical parameters predicted (bacterial concentration, substrate concentration, and relative enzyme activity) are shown in Figures7, 9and 11, respectively.

Table 1 .
Comparison of the errors of the two modeling methods.

Table 1 .
Comparison of the errors of the two modeling methods.

Table 1 .
Comparison of the errors of the two modeling methods.