Quantitative Analysis of Biodiesel Adulterants Using Raman Spectroscopy Combined with Synergy Interval Partial Least Squares (siPLS) Algorithms

: Biodiesel has emerged as an alternative to traditional fuels with the aim of reducing the impact on the environment. It is produced by the esteriﬁcation of oleaginous seeds, animal fats, etc., with short-chain alcohols in an alkaline solution, which is one of the most commonly used methods. This increases the oxygen content (from the fatty acids) and promotes the fuel to burn faster and more efﬁciently. The accurate quantiﬁcation of biodiesel is of paramount importance to the fuel market due to the possibility of adulteration, which can result in economic losses, engine performance issues and environmental concerns related to corrosion. In response to achieving this goal, in this work, synergy interval partial least squares (siPLS) algorithms in combination with Raman spectroscopy are used for the quantiﬁcation of the biodiesel content. Different pretreatment methods are discussed to eliminate a large amount of redundant information of the original spectrum. The siPLS technique for extracting feature variables is then used to optimize the input variables after pretreatment, in order to enhance the predictive performance of the calibration model. Finally, the D1-MSC-siPLS calibration model is constructed based on the preprocessed spectra, the selected input variables and the optimized model parameters. Compared with the feature variable selection methods of interval partial least squares (iPLS) and backward interval partial least squares (biPLS), results elucidate that the D1-MSC-siPLS calibration model is superior to the D1-MSC-biPLS and the D1-MSC-iPLS in the quantitative analysis of adulterated biodiesel. The D1-MSC-siPLS calibration model demonstrates better predictive performance compared to the full spectrum PLS model, with the optimal determination coefﬁcient of prediction (R 2P ) being 0.9899; the mean relative error of prediction (MREP) decreased from 9.51% to 6.31% and the root--mean-squared error of prediction (RMSEP) decreased from 0.1912% ( v / v ) to 0.1367% ( v / v ), respectively. The above results indicate that Raman spectroscopy combined with the D1-MSC-siPLS calibration model is a feasible method for the quantitative analysis of biodiesel in adulterated hybrid fuels.


Introduction
The increase in global greenhouse gas emissions and the scarcity of petroleum fuels have prompted a search for clean and sustainable alternative energy sources.One of these is biodiesel, which has become the main fuel for the transport industry in several countries [1,2].Biodiesel is produced using the transesterification reactions of short-chain alcohols (methanol or ethanol) with oil-containing seeds, animal fats or recycled cooking oils [3,4], catalyzed by acids, bases and enzymes.Acid and base catalysis can be homogeneous or multiphase [5].Conventional homogeneous catalysis, especially base catalysis, greatly increases the reaction rate over slow acid catalysis, but the main disadvantage is the extra cost of catalyst removal.In contrast, multiphase catalysis simplifies the steps and requires no additional recovery of the catalyst after the reaction has taken place, as well as being a new development in production due to its reusability and the subsequent sustainability of the process [6].The development of biocatalysts opens up another avenue for the industrial production of biodiesel, but even the most successful lipases are still in the early stages of industrial production [7].It has several advantages over petroleum diesel due to its absence of sulfur-free and aromatic compounds, higher cetane number, and a higher flash point [8], as well as being biodegradable [9].To promote the use of biodiesel, the European Union has implemented a policy requiring a 10% (v/v) proportion of biofuels in transportation fuels to replace fossil fuels [10].In China, the national standard for "B5 diesel" requires the inclusion of 1-5% (v/v) BD100 biodiesel to be blended with diesel fuel, as well as the implementation of biodiesel in line with national standards of excise tax exemption and 70% of the value-added tax, which is a refundable policy [11].Unfortunately, certain businessmen adulterate diesel fuel by adding cheap vegetable oils [12], kerosene [13] or low-grade oils instead of biodiesel for illicit profits.One of the most common methods of adulteration is the addition of vegetable oils to diesel fuel because of their excellent miscibility and similar molecular properties [14].All research focusing on biodiesel adulteration of higher viscosity vegetable oils has showed that atomization and spraying were hindered at the fuel nozzles and also caused decreases in the engine power output and thermal efficiency, carbon deposition and other problems [15].This not only leads to financial losses but also increases fuel consumption along with emissions of particulate and exhaust pollutants.It is crucial to establish dependable analytical techniques for ascertaining the quantity of biodiesel contained in diesel fuel blends that may contain vegetable oils.
Various analytical approaches have been reported for identifying and quantifying biodiesel in diesel-biodiesel blends, which include chromatographic techniques [16][17][18], infrared (IR) spectroscopy [19,20], ultraviolet-visible (UV-Vis) spectroscopy [21,22], fluorescence spectroscopy [23], and nuclear magnetic resonance (NMR) spectroscopy [24][25][26].Gas chromatography (GC) and high-performance liquid chromatography (HPLC) have great advantages in determining the biodiesel content in biodiesel/diesel blends.In particular, GC is the standard method for determining the extent of conversion of feedstock to biodiesel with high accuracy and good reproducibility.However, the use of organic solvents in the operation is contrary to the concept of green chemistry, and the analyses are time-consuming and complex operations.Thus, there is a need for fast, efficient, non-destructive, and environmentally friendly techniques for the quantitative analysis of biodiesel.The application of chemometrics in combination with various spectroscopic techniques has been widely used for identifying and characterizing fuels or biofuels.For example, dos Santos et al. [27] utilized Fourier-transform infrared spectroscopy (FTIR) coupled with linear discriminant analysis (LDA) and partial least squares discriminant analysis (PLS-DA) to differentiate fuel blends.The findings demonstrate that LDA and PLS-DA approaches are capable of distinguishing biodiesel samples ranging from 10% to 100%.However, the moisture in the fuel blend will affect its ability to be quantified by NIR spectroscopy [28].UV-Vis spectroscopy provides limited information for compound analysis and is typically used in conjunction with other test methods.Fluorescence spectroscopy is susceptible to many interfering factors, such as photodecomposition and oxygen quenching, and the relationship between fluorescence and compound structure is not well defined.NMR spectroscopy is widely used for the structural analysis of various compounds due to its high accuracy.Nevertheless, this technique also has the disadvantages of a high temperature consumption, susceptibility to interference from other substances, and operational difficulties.Therefore, novel and reliable techniques are needed to overcome the limitations of traditional methods in measuring the amount of biodiesel present in diesel fuel blends.
Raman spectroscopy not only displays sharp characteristic peaks corresponding to functional groups in various types of samples (liquids, solids and powders, etc.), but also the positions and intensities of these peaks matched with the analyzed substances can sensitively reflect the intrinsic structural and variation information of the relevant substances, enabling their identification and characterization [29].Raman spectroscopy is a non-destructive, fast method that eliminates the need for sample preparation [30,31].This technique coupled with diverse chemometric methods [32][33][34] has been applied extensively for the identification of unknown substances in fields such as materials science, biology, pharmaceuticals, and food science [35][36][37][38].Furthermore, it has also been employed for the analysis of different types of fuels including gasoline, kerosene, and alcohol fuels [39][40][41].For example, Dantas et al. [42] identified and quantified two biofuels (biodiesel and hydrotreated esters and fatty acids (HEFA)) and adulterants in petroleum diesel using multivariate curve resolution-alternating least squares (MCR-ALS) integrated with Raman spectroscopy.The quantitative model of biofuels developed had an RMSEP value of 0.70% and the RMSEPs for the concentration of adulterants were estimated with values below and up to 8%.The use of rugged and easy-to-use portable Raman spectrometers may offer more possibilities for the rapid and non-invasive identification of suspect samples, especially in field and real-time assessments.
There is an urgent need for technology to quantitatively analyze adulterated biodiesel, which has not been reported in the previous literature.In this work, portable Raman spectroscopy combined with a synergistic interval partial least squares (siPLS) method that can accurately select the associated variables of the biodiesel characteristic peaks is proposed as a technique to quantify biodiesel.Firstly, Raman spectra were acquired from fuel samples with different adulterations to investigate the characteristic Raman spectra of different fuels.Then, various preprocessing methods were compared to establish suitable PLS calibration models.Additionally, several variable selection methods, including iPLS and biPLS, were compared to examine their effects on the calibration model and to validate the ability of different characteristic Raman spectral regions to quantify biodiesel.Finally, an optimal calibration model was constructed based on appropriate spectral preprocessing and model parameters.The aim of this work is to develop a rapid and non-destructive technique for the quantitative analysis of biodiesel using Raman spectroscopy combined with multiple regression (PLS) and variable selection techniques (siPLS).It provides a methodological reference for effective enforcement in the quality inspection department.In future work, we will expand the quantitative analysis of biodiesel in the presence of more adulterants and apply the tools we have developed to the quantitative analysis of a wide range of fuels, as well as to deeper compositional and performance indicator detection.

Experimental Samples
A ternary mixture of 31 diesel fuels (0#, PetroChina, Xi'an, China), biodiesel (Jinan Xinwo Chemicals Co., Ltd., Jinan, China), and soybean oil (S817900-250 mL, Beijing Jinbailin Technology Co., Ltd., Beijing, China) was preliminarily prepared.The range of the biodiesel volume fraction in diesel fuel, according to the petroleum and chemical industry standard NB/SH/T 0916-215, is 0.5-20%.The 3 components by this standard were mixed evenly by ultrasound, as vegetable oil is more viscous, and then allowed to equilibrate at room temperature for further testing.The volume percentages of each component in the fuel samples are shown in Table 1.

Spectral Collection
Raman spectra information were collected using a Qepro6500 laser Raman spectrometer (Ocean Optics, Delray Beach, FL, USA) equipped with a 785 nm semiconductor laser.The spectral acquisition range was 0-2000 cm −1 , and the laser power was 300 mW.Prior to each sample acquisition, background spectra were acquired using a clean, empty cuvette.The resolution was 4 cm −1 and the ambient was 20 • C during the acquisition.Measurements were repeated 5 times for each sample, and the mean spectra were taken to build the model.

siPLS Method
Both the synergy interval partial least squares (siPLS) [43] and the backward interval partial least squares (biPLS) [44] are based on the interval partial least squares (iPLS) approach [45].iPLS divides the entire spectral region into k equidistant intervals and constructs PLS models on each interval.The sub-intervals in which the local model with the minimum root-mean-square error is located are the characteristic waveband obtained using cross-validation (RMSECV).The biPLS and siPLS approach perform different operations on the spectral data based on the equal division of the full spectrum into k intervals using the iPLS method.Among others, the biPLS divides the spectrum into k equidistant intervals, removes the interval with the lowest correlation among the k intervals, and performs PLS calibration model on the remaining (k-1) joint intervals.This process is repeated iteratively until only one interval remains.The RMSECV value of the PLS model for each subinterval was used as an evaluation metric, with the minimum value corresponding to the best combination interval.In contrast, the siPLS randomly selects sets of j (2 ≤ j ≤ k) sub-intervals from the k intervals and builds PLS models.The combination of j intervals corresponding to the minimum RMSECV value represents the optimal spectral wavebands.The advantage of siPLS over iPLS and biPLS is that it improves the predictive power of effective components of the model by combining several partial models with higher precisions in multiple equidistant sub-intervals to find the most relevant regions of information.

Construction and Evaluation of Calibration Models
Spectral data preprocessing is an important step to address the interference factors such as noise, over-lapping peaks, and baselines affecting the spectra in establishing accurate quantitative analysis models.Firstly, spectral preprocessing was performed to eliminate noise from the raw spectra where the modelled performance was disturbing.Simultaneously, the Kennard-Stone method (sample-set division based on calculation of distances in the x-vector direction) was employed to separate the preprocessed spectral data into a calibration set (70%) and a prediction set (30%) (Table 1).Additionally, variable selection methods helped to fully utilize relevant variables that contributed to predictive performance.Therefore, characteristic intervals were selected using siPLS in the spectral range of 0-2000 cm −1 , and combinations of 2-4 intervals were used as input variables to construct all possible PLS quantitative analysis models.Finally, the lowest RMSECV value obtained using the PLS model was selected as the feature interval.The optimal number of latent variables (LVs) was determined by the results of the best cross-validation (CV).The determination of the accuracy of the prediction results was dependent on the agreement between the PLS model predictions and the reference values.Multiple metrics were used to evaluate the predictive performance of the model, including mean relative error of prediction (MREP), determination coefficient of prediction (R 2 P ), root-mean-squared error of prediction (RMSEP), and residual predictive deviation (RPD).Leave-one-out crossvalidation (LOO-CV) was employed to assess the stability of the models.All calculations were implemented using MATLAB (2016b).The modeling process is illustrated in Figure 1.

Construction and Evaluation of Calibration Models
Spectral data preprocessing is an important step to address the interference fa such as noise, over-lapping peaks, and baselines affecting the spectra in establishin curate quantitative analysis models.Firstly, spectral preprocessing was perform eliminate noise from the raw spectra where the modelled performance was distu Simultaneously, the Kennard-Stone method (sample-set division based on calculat distances in the x-vector direction) was employed to separate the preprocessed sp data into a calibration set (70%) and a prediction set (30%) (Table 1).Additionally, va selection methods helped to fully utilize relevant variables that contributed to pred performance.Therefore, characteristic intervals were selected using siPLS in the sp range of 0-2000 cm −1 , and combinations of 2-4 intervals were used as input variab construct all possible PLS quantitative analysis models.Finally, the lowest RMSECV obtained using the PLS model was selected as the feature interval.The optimal num latent variables (LVs) was determined by the results of the best cross-validation (CV determination of the accuracy of the prediction results was dependent on the agree between the PLS model predictions and the reference values.Multiple metrics were to evaluate the predictive performance of the model, including mean relative error o diction (MREP), determination coefficient of prediction (R 2 P), root-mean-squared er prediction (RMSEP), and residual predictive deviation (RPD).Leave-one-out cross dation (LOO-CV) was employed to assess the stability of the models.All calculations implemented using MATLAB (2016b).The modeling process is illustrated in Figure

Raman Spectral of Biodiesel
Raman spectra of pure diesel, pure biodiesel and pure soybean oil that have baseline-corrected to reduce the linear variation across the in the 0-2000 cm −1 rang shown in Figure 2a.It can be seen from the spectrogram that the characteristic pea the three oils were as follows: for diesel, the relevant vibrations were attributed to stretching, in-plane CH3 rocking, and C-O stretching (800-900 cm 1 ), C-C stretching ( 1150 cm 1 ), =C-H bending (1245-1277 cm 1 ), and CH2/CH3 bending vibrations (1400 cm 1 ) [2].In comparison to diesel, biodiesel exhibited characteristic bands related to bending (1290-1320 cm 1 ), C=C stretching (1600-1700 cm 1 ), and C=O stretching tions (1700-1800 cm 1 ) [30].As for soybean oil, its characteristic bands were highly si

Raman Spectral of Biodiesel
Raman spectra of pure diesel, pure biodiesel and pure soybean oil that have been baseline-corrected to reduce the linear variation across the in the 0-2000 cm −1 range are shown in Figure 2a.It can be seen from the spectrogram that the characteristic peaks of the three oils were as follows: for diesel, the relevant vibrations were attributed to C 1 -C 2 stretching, in-plane CH 3 rocking, and C-O stretching (800-900 cm −1 ), C-C stretching (1050-1150 cm −1 ), =C-H bending (1245-1277 cm −1 ), and CH 2 /CH 3 bending vibrations (1400-1500 cm −1 ) [2].In comparison to diesel, biodiesel exhibited characteristic bands related to CH 2 bending (1290-1320 cm −1 ), C=C stretching (1600-1700 cm −1 ), and C=O stretching vibrations (1700-1800 cm −1 ) [30].As for soybean oil, its characteristic bands were highly similar to those of biodiesel, including C=C bending (968 cm −1 ), C-H bending (1245-1300 cm −1 ), C=C stretching (1600-1700 cm −1 ), and C=O stretching vibrations (1700-1800 cm −1 ) [46].This is consistent with the conclusion drawn in the relevant literature.The Raman spectra of different fuels differed in their fingerprint region and range, enabling the compounds present in each sample to be identified.It can be seen from Figure 2b that the Raman spectra showed certain variations as the biodiesel content in the fuel increased: a change of Raman intensity occurred in the spectra of the different samples, but their Raman shifts of these spectral lines remained relatively constant.As can be seen from the gradual increase in the intensity of the characteristic peak of C=O (1700-1800 cm −1 ) in the samples, it is possible to obtain information about the biodiesel content by analyzing the characteristic region of the band of adulterated diesel.This is consistent with the conclusion drawn in the relevant literature.The Raman spec of different fuels differed in their fingerprint region and range, enabling the compoun present in each sample to be identified.It can be seen from Figure 2b that the Raman sp tra showed certain variations as the biodiesel content in the fuel increased: a change Raman intensity occurred in the spectra of the different samples, but their Raman sh of these spectral lines remained relatively constant.As can be seen from the gradual crease in the intensity of the characteristic peak of C=O (1700-1800 cm −1 ) in the samples is possible to obtain information about the biodiesel content by analyzing the character tic region of the band of adulterated diesel.

Optimization of Spectral Preprocessing Methods
During the collection of Raman spectra, various factors such as the collection en ronment, sample background, and light scattering can introduce significant redundanc into the raw spectra.Therefore, preprocessing of the original Raman spectra is necessa before constructing the PLS calibration model.This study compared different spectral p processing methods, including normalization (Nor), standard normal variation (SN wavelet transform (WT) [47], first-order derivation (D1st), and multivariate scatter corr tion (MSC) [30], as well as their different combinations, to investigate their effects on calibration model of PLS.The D1st preprocessing method effectively removes basel drift and background noise, which increases peak resolution and improves the signalnoise ratio.MSC and SNV have similar purposes and are mainly used to eliminate

Optimization of Spectral Preprocessing Methods
During the collection of Raman spectra, various factors such as the collection environment, sample background, and light scattering can introduce significant redundancies into the raw spectra.Therefore, preprocessing of the original Raman spectra is necessary before constructing the PLS calibration model.This study compared different spectral preprocessing methods, including normalization (Nor), standard normal variation (SNV), wavelet transform (WT) [47], first-order derivation (D1st), and multivariate scatter correction (MSC) [30], as well as their different combinations, to investigate their effects on the calibration model of PLS.The D1st preprocessing method effectively removes baseline drift and background noise, which increases peak resolution and improves the signal-tonoise ratio.MSC and SNV have similar purposes and are mainly used to eliminate the influence of physical factors, such as variations in the spectral range and solid particle size, on spectral data.Spectral analysis showed that the preprocessed spectra exhibited richer peaks, clearer waveforms and higher resolution.The model evaluation criteria were R 2  CV and RMSECV.As shown in Table 2, D1st was the optimal individual preprocessing method (R 2 CV = 0.9741, RMSECV = 0.1661% (v/v)).Compared to the original spectra (R 2 CV of 0.9738 and RMSECV of 0.1912% (v/v)), the PLS calibration model achieved significant improvements.The others preprocessing methods, i.e., Nor, SNV, WT, and MSC, did not enhance the R 2 CV and RMSECV of the model; instead, they led to a decrease in performance.However, a slightly better prediction performance was observed for the prediction dataset after applying the Nor and MSC methods, with R 2 p values increasing from 0.9868 to 0.9894 and 0.9896, respectively.
The three preprocessing methods, D1st, MMS and MSC, which have different enhancements to the model, were combined in various forms.It was finally determined that the spectral data obtained after D1-MSC preprocessing would be used as input variables for establishing the calibration model.Combined with the optimal LVs (5), the D1-MSC-PLS model achieved an R 2 CV of 0.9812 and an RMSECV of 0.1558% (v/v).When this model was used for prediction, the R 2 P was 0.9932, the RMSEP was 0.1413% (v/v), and the MREP was 7.10%.These performance indicators showed slight improvements compared to the original spectra (R 2 P of 0.9868, RMSEP of 0.1654% (v/v), and MREP of 9.51%) and the D1-PLS (R 2 P of 0.9842, RMSEP of 0.1931% (v/v), and MREP of 11.07%) models, confirming D1-MSC as the effective spectral preprocessing method for constructing the PLS model.The stability within the model can be increased by optimizing the smoothing points in D1st using LOO-CV.The number of smoothing points is crucial for the D1st method.If the quantity is too small, the denoising effect of the spectrum will be poor, and instead, valuable information may be eliminated, leading to signal distortion.The R 2  CV and RMSECV were used as indicators to assess model predictive performance.Figure 3b shows that RMSECV decreased and then increased as the number of smoothing points was increased.The minimum RMSECV (0.1661% (v/v)) and optimal R 2 CV (0.9741) were achieved when the number of smoothing points was 9.At that point, the PLS calibration model exhibited good predictive power.Additionally, the optimization of the model's latent variables (LVs) [43] was conducted.As shown in Figure 3a, the RMSECV initially decreased and then increased with cross-validation.The minimum RMSECV was reached when the LV was 5. Going beyond 5 LVs increased the risk of overfitting.Therefore, setting 5 the LV for the model was deemed reasonable.

PLS Calibration Model Based on siPLS Feature Variables Selection
The PLS calibration model based on preprocessed spectral data showed a slight improvement in prediction performance over the original spectra.However, there were still some irrelevant variables in the spectra, as well as multicollinearity among the variables, which not only increased the modelling time and complexity, but also reduced the robustness and accuracy of the model [44].Therefore, a model is constructed by extracting the relevant variables of the characteristic biodiesel peaks from the whole spectrum as a way to improve the predictive performance of the model [46].In this study, the method of siPLS was employed to divide the preprocessed spectral matrix (D1-MSC) into 10-20 equal intervals.Since the siPLS algorithm requires selecting combinations of spectral bands and the computational process is complicated, the randomly selected number of sub-interval combinations (j) was typically less than 5.In this experiment, j was set to 2, 3, and 4. Table 3 presents the results of feature subinterval selection by siPLS for different values of k and j.It can be observed that as the quantity of separation intervals decreases and the combination intervals rises, the RMSECV value decreases, indicating the selection of more effective spectral regions.Considering that the RMSEP value is based on the prediction error calculated from the test set data, the high RMSEP value of (siPLS (2)) indicates that the model tends to over-fit under this parameter.However, this problem is overcome as the number of combined intervals increases and the stability of the proposed method is improved.When k = 12 and j = 4, the corresponding R 2 CV was 0.9842, and the minimum RMSECV was achieved at 0.1329% (v/v).
Figure 4 illustrates the feature variable intervals selected by siPLS, which closely corresponded to the characteristic peaks of C=C stretching and C-H bending in the original spectra.This confirms the feasibility of the siPLS-feature variable selection method.The calculation error expresses the mathematical significance of the model, while the screening of characteristic peaks better corresponds to the physicochemical significance of the study.The optimization of LVs was carried out simultaneously with siPLS variable selection.When LVs were set to 5, the RMSECV of the optimal combination model reached a stable state.The calibration model was then established using the optimal combination (k = 12, j = 4) and the best LVs.The ability of the model to capture variations in measurement data was evaluated using R 2 , which represents the ability of the independent variables to explain the induced variables.A higher R 2 value indicates better fitting performance.The RMSE, which represents the predictive accuracy of the model, is the primary indicator for evaluating the performance of regression models.A lower RMSE value is desirable.RPD is a measure of the predictive power of the model, with higher RPD values indicating better regression performance.Models with RPD values greater than 6 are considered to have good regression performance.The variable selection methods in Table 3 (siPLS ( 4)) all have better performance on RMSECV compared to PLS (0.9812) after full spectral

PLS Calibration Model Based on siPLS Feature Variables Selection
The PLS calibration model based on preprocessed spectral data showed a slight improvement in prediction performance over the original spectra.However, there were still some irrelevant variables in the spectra, as well as multicollinearity among the variables, which not only increased the modelling time and complexity, but also reduced the robustness and accuracy of the model [44].Therefore, a model is constructed by extracting the relevant variables of the characteristic biodiesel peaks from the whole spectrum as a way to improve the predictive performance of the model [46].In this study, the method of siPLS was employed to divide the preprocessed spectral matrix (D1-MSC) into 10-20 equal intervals.Since the siPLS algorithm requires selecting combinations of spectral bands and the computational process is complicated, the randomly selected number of sub-interval combinations (j) was typically less than 5.In this experiment, j was set to 2, 3, and 4. Table 3 presents the results of feature subinterval selection by siPLS for different values of k and j.It can be observed that as the quantity of separation intervals decreases and the combination intervals rises, the RMSECV value decreases, indicating the selection of more effective spectral regions.Considering that the RMSEP value is based on the prediction error calculated from the test set data, the high RMSEP value of (siPLS (2)) indicates that the model tends to over-fit under this parameter.However, this problem is overcome as the number of combined intervals increases and the stability of the proposed method is improved.When k = 12 and j = 4, the corresponding R 2 CV was 0.9842, and the minimum RMSECV was achieved at 0.1329% (v/v).
Figure 4 illustrates the feature variable intervals selected by siPLS, which closely corresponded to the characteristic peaks of C=C stretching and C-H bending in the original spectra.This confirms the feasibility of the siPLS-feature variable selection method.The calculation error expresses the mathematical significance of the model, while the screening of characteristic peaks better corresponds to the physicochemical significance of the study.The optimization of LVs was carried out simultaneously with siPLS variable selection.When LVs were set to 5, the RMSECV of the optimal combination model reached a stable state.The calibration model was then established using the optimal combination (k = 12, j = 4) and the best LVs.The ability of the model to capture variations in measurement data was evaluated using R 2 , which represents the ability of the independent variables to explain the induced variables.A higher R 2 value indicates better fitting performance.The RMSE, which represents the predictive accuracy of the model, is the primary indicator for evaluating the performance of regression models.A lower RMSE value is desirable.RPD is a measure of the predictive power of the model, with higher RPD values indicating better regression performance.Models with RPD values greater than 6 are considered to have good regression performance.The variable selection methods in Table 3 (siPLS (4)) all have better performance on RMSECV compared to PLS (0.9812) after full spectral preprocessing, indicating the necessity of using variable selection as a preprocessing step in spectral detection, and the D1st-MSC-siPLS model showed a significant improvement in predictive performance.The R 2  CV and R 2 P values were both above 0.98, and the RMSEP decreased from 0.1413% (v/v) to 0.1367% (v/v).The MREP decreased from 7.10% to 6.31%, while the RPD P was 9.95, indicating the successful modeling of the model.preprocessing, indicating the necessity of using variable selection as a preprocessing step in spectral detection, and the D1st-MSC-siPLS model showed a significant improvement in predictive performance.The R 2 CV and R 2 P values were both above 0.98, and the RMSEP decreased from 0.1413% (v/v) to 0.1367% (v/v).The MREP decreased from 7.10% to 6.31%, while the RPDP was 9.95, indicating the successful modeling of the model.

Comparison of Different PLS Calibration Models
The performance of the model varied with different preprocessing methods and variable selection methods.Two additional variable selection methods, iPLS and biPLS, were compared to siPLS.To further validate the ability of the D1-MSC-siPLS model to quantify biodiesel, as shown in Table 4, the feature wave numbers were reduced to 87, 261, and 378, selected by the iPLS biPLS and siPLS algorithms, respectively.They were used as input variables to the PLS model to build a quantitative biodiesel model.The siPLS picks out variables that are more included in the location of feature peaks, reflecting the superiority of the algorithms.The cross-validated R 2 CV of the three feature variable extraction methods was similar, ranging from 0.9738 to 0.9842.However, the RMSECV values of iPLS and biPLS, after variable extraction, decreased from 0.1912% (v/v) to 0.1692% (v/v) and 0.1811% (v/v), respectively, which was not superior to the RMSECV value of siPLS (0.1329% (v/v)).Additionally, both methods had LVs greater than 5, indicating a risk of overfitting.The

Comparison of Different PLS Calibration Models
The performance of the model varied with different preprocessing methods and variable selection methods.Two additional variable selection methods, iPLS and biPLS, were compared to siPLS.To further validate the ability of the D1-MSC-siPLS model to quantify biodiesel, as shown in Table 4, the feature wave numbers were reduced to 87, 261, and 378, selected by the iPLS biPLS and siPLS algorithms, respectively.They were used as input variables to the PLS model to build a quantitative biodiesel model.The siPLS picks out variables that are more included in the location of feature peaks, reflecting the superiority of the algorithms.The cross-validated R 2 CV of the three feature variable extraction methods was similar, ranging from 0.9738 to 0.9842.However, the RMSECV values of iPLS and biPLS, after variable extraction, decreased from 0.1912% (v/v) to 0.1692% (v/v) and 0.1811% (v/v), respectively, which was not superior to the RMSECV value of siPLS (0.1329% (v/v)).Additionally, both methods had LVs greater than 5, indicating a risk of overfitting.The RMSEP values also did not improve compared to D1st-MSC-siPLS.Among the models, D1st-MSC-biPLS (RMSEP = 0.1879% (v/v)) and D1st-MSC-iPLS (RMSEP = 0.2263% (v/v)) showed marginally improvements in external validation results, but no improvement was observed in cross-validation results.The optimal preprocessing full-spectrum model, D1st-MSC, had an R 2 P of 0.9932, which was higher than D1st-MSC-siPLS (R 2 P = 0.9899).However, when R 2 is within a reasonable range, MRE is the primary consideration.Therefore, D1st-MSC-siPLS was determined as the optimal model, as it improved the R 2 CV of the calibration set from 0.9738 to 0.9842, decreased the RMSECV from 0.1912% (v/v) to 0.1329% (v/v), significantly reduced the number of variables from 1044 to 378, decreased the modeling time, and improved the model's predictive performance.

Conclusions
In this study, a promising model in the accuracy improvement of the ability to quantitatively analyze soybean oil-adulterated biodiesel based on Raman spectroscopy and machine learning methods was presented.The D1-MSC-siPLS correction model was constructed after optimizing the input variables using a D1-MSC preprocessing method and the siPLS feature variable selection method.Compared with the iPLS and biPLS variable selection methods, results elucidate that the D1-MSC-siPLS calibration model is superior There are many different approaches in the literature to study the partial or total replacement of biodiesel with different vegetable oils, proving that this issue is very important and should be approached from different perspectives.In the literature, in the batch determination of the identity of biodiesel-diesel blends (concentration and feedstock) by using UFGC, GC has been successfully applied [48].Liu et al. [2] found that the prediction of biodiesel concentration in blended fuels combined partial least squares (PLS) calibration with the C-H eigenzones of Raman spectroscopy.The quantitative and qualitative analyses of biodiesel using NMR spectroscopic methods were carried out by Doudin [24].Although there have been many studies that have made significant contributions to the field, there is still a need to develop a rapid detection method to address the quantification of biodiesel in adulterated fuels.Our work is a pioneering study that demonstrates the feasibility of Raman spectroscopy as a quantitative detection technique for biodiesel containing adulterants.In addition, Raman spectroscopy has the potential to be an alternative technique for use by government agencies or quality control laboratories due to its unique capabilities.

Conclusions
In this study, a promising model in the accuracy improvement of the ability to quantitatively analyze soybean oil-adulterated biodiesel based on Raman spectroscopy and machine learning methods was presented.The D1-MSC-siPLS correction model was constructed after optimizing the input variables using a D1-MSC preprocessing method and the siPLS feature variable selection method.Compared with the iPLS and biPLS variable selection methods, results elucidate that the D1-MSC-siPLS calibration model is superior in the quantitative analysis of adulterated biodiesel.The cross-validated R 2  CV of the siPLS calibration set improved from 0.9738 to 0.9842, and the RMSECV decreased from 0.1912% (v/v) to 0.1329% (v/v).Moreover, the number of variables was significantly reduced from 1044 to 378, reducing the modeling time.The model's predictive performance also improved, with an R 2 P of 0.9899 and RMSEP decreasing from 0.1654% (v/v) to 0.1367% (v/v), and MREP decreasing from 9.51% to 6.31%.These results demonstrate the feasibility of using Raman spectra combined with the PLS calibration model for the quantitative analysis of biodiesel in adulterated diesel/biodiesel blends.

Figure 1 .
Figure 1.Research flow chart of the work.

Figure 1 .
Figure 1.Research flow chart of the work.

Figure 3 .
Figure 3. Predictive performance of PLS models with different parameters ((a): the effect of latent variables; (b): the effect of different smoothing points of D1st).

Figure 3 .
Figure 3. Predictive performance of PLS models with different parameters ((a): the effect of latent variables; (b): the effect of different smoothing points of D1st).

Figure 5 13 Figure 5 .
Figure 5 clearly shows the correlation between the predicted values from the D1st-MSC-siPLS calibration model and the sample standard reference values for biodiesel.Clearly, they exhibited a strong linear relationship.Additionally, the MREP value, in particular, significantly decreased after variable selection.These results demonstrate that spectral preprocessing and appropriate variable selection strategies can significantly enhance the quantity performance of the model.Machine learning methods could be faster and more effective for quantitative analyses of biofuels in adulterated diesel/biodiesel blends.Appl.Sci.2023, 13, x FOR PEER REVIEW 11 of 13

2 Figure
Figure The prediction performance of the model proposed in this work for biodiesel adulterants.

Table 1 .
Volume percentages of each component in fuel samples.

Table 2 .
Comparison of predictive performance of synergy interval partial least squares (PLS) calibration models based on different spectral preprocessing methods.
Note: k means decomposition layers, db3 means wavelet function.

Table 3 .
Cross-validation results of PLS calibration models with different separation intervals and different combinations of interval numbers.

Table 3 .
Cross-validation results of PLS calibration models with different separation intervals and different combinations of interval numbers.

Table 4 .
Comparison of predictive performances of PLS calibration models.