1. Introduction
With an increasing demand for vegetable oils, both in the food industry and for the production of biofuels, there is a call for an effective and environmentally friendly method to remove naturally occurring phospholipids, known as “gums”, during oil refinement [
1]. Degumming has traditionally been carried out using water followed by a chemical refining step, such as acid degumming, but is today generally conducted using enzymatic methods, which offer both higher oil yield and lower environmental impact [
2]. Degumming is an essential step in the refinement process of vegetable oils, as it serves to reduce the content of phospholipids. Enzymatic degumming is well established, and different types of phospholipases are commercially available.
Figure 1 shows an overview of the action of different types of phospholipases. The use of phospholipase type A (PLA) is an efficient way to reduce the content of phospholipids by hydrolyzation into lyso-phospholipids and free fatty acids (FFA). This enzymatic action results in higher oil yield, both due to higher FFA content in the oil and due to less entrapment of glycerides in the gum phase [
2,
3,
4].
In industry, enzymatic degumming is typically carried out in continuous stirred compartmentalized tank reactors with a holding time of 4–6 h depending on the oil origin and initial phospholipid content. Typically, the process is run according to a recipe with a fixed holding time and monitored by taking out grab samples for laboratory analysis [
2]. The industry standard for analysis is determination of FFA content by titration [
5], which is time-consuming, labor intensive and requires use of chemicals. In addition, the reproducibility of the method is questionable [
5].
Near-infrared spectroscopy (NIRS) is used in multiple production processes to monitor different quality attributes in the vegetable oil processing industry. These include dry matter and titratable acidity [
6], yield [
7] triglycerides [
8] and free fatty acid composition [
9]. NIRS has several advantages in comparison to the chemical method: samples can be measured in few seconds, it is non-destructive and no sample preparation or pre-treatment is required. NIRS is an interesting candidate method for real-time monitoring of enzymatic degumming, and has previously been used to monitor FFA in biodiesel production [
10,
11] and to monitor transesterification reactions [
12,
13,
14]. NIRS is especially advantageous due to its high speed and adaptability to on-line monitoring [
15]. In addition, the combination of NIRS and multivariate analysis enables monitoring and control of several quality attributes from only one spectroscopic sensor.
Replacement of the current grab sample strategy with real-time NIRS monitoring during the enzymatic degumming process will greatly facilitate automatic optimization of the process. Acquisition of NIR spectra from a bioreactor can be either on-line or in-line, and depending on the process equipment and the type of process, each mode of measurement has its own advantages [
15]. Traditional in-line fiber optic immersion probes may have limited suitability for the viscous oil matrix involved in the enzymatic degumming process. Hence, the on-line solution proposed in this study consists of a closed bypass loop with a flow cell, where the process flow goes through and returns to the reactor, giving several benefits in comparison to a fiber optic probe mounted directly in the reactor. An on-line bypass loop is very beneficial in mitigating formation of air in the reactor due to high stirring rates and allows for moving the viscous matrix through a sampling window. On-line NIRS enables the control of enzyme dosage and holding time by feedback PAT (Process Analytical Technology) control [
16], ensuring that the process is always in its optimal state, which is beneficial for economic, production time/capacity and environmental/resource expenditure reasons. The holding time and required amount of enzyme is highly dependent on the state of the raw material, i.e., the initial phosphorus content of the oil, the enzymatic performance in the given environment and the current state of the process [
2]. NIRS can facilitate this real-time and on-line monitoring of the current reaction state, leading to full control and optimization of the process.
Modern NIR instrumentation enables very fast acquisition of spectral scans during on-line monitoring. For interferometer-based instruments, where a spectrum is obtained by Fourier Transformation (FT) of the measured interferogram, a scan can typically be obtained in a few seconds. If several scans are averaged to reduce the measurement noise, the spectral noise can be approximated as
, where
N is the number of scans. Thus, to reduce the noise by 50%, the number of scans must be increased fourfold. The signal to noise ratio (SNR), or sensitivity, can be approximated by
. However, SNR only improves until the point where instrumental drift or other minor factors will start to dominate the random noise [
17,
18]. In addition, the duration of measurement will at some point start to influence the apparent process dynamics measured. In practice, measurement duration will represent a compromise between process dynamics and measurement uncertainty [
16]. To achieve an acceptable level of noise (measurement uncertainty) during NIRS analysis, it is common to collect and average a reasonable number of scans, causing a single measurement campaign to last several minutes. In case of a fast-changing chemical process, i.e., a process with a high reaction rate such as industrial enzymatic degumming, the average of these scans may lead to misinterpretation of the current state of the process, since the process may have changed considerably during acquisition from the first to last scan. The resulting average spectra simply does not represent the process state as it was at the timepoint of the last scan but rather the “chemical average” of the process kinetics as it happened during the whole data acquisition period. This difference between the final produced averaged spectra and the change in process presents an obvious problem when trying to regulate the process to an optimal setpoint. This paper aims at investigating the effect of the compromise between process dynamics and measurement uncertainty by varying the spectral averaging strategy to be adapted to the process dynamics.
As an alternative to acquiring NIRS measurements by averaging a fixed number of scans, a method where the instrument is set to generate scans in a free-run mode is proposed. In such a setup, there is no traditional beginning or end of a measurement, but just an endless stream of individual spectral scans. By applying a window function in the time-domain on this stream, it is possible to control how spectra are collated (averaged) prior to subsequent data analysis. By proper weighing of the prior scans in the window, not only will spectral noise (measurement uncertainty) be reduced as described above, but the changing reaction kinetics of the process can be considered as well. The way prior samples are weighted in the window is controlled by the width of the window. In the present study, we will examine the effect of varying the window width in the range of 0 to 300 s in intervals of 30 s.
In addition, the effect and performance of each window width based on PLS calibration models of FFA content is evaluated. This includes evaluation of different spectral preprocessing techniques and application of two different variable selection algorithms to obtain the most robust calibration model. The model performance is evaluated based on RMSECV (Root Mean Square Error of Cross Validation) and number of latent variables included and is illustrated using response surface plots.
2. Materials and Methods
Two batches of crude oil originating from the United States were used in the experiment. The free fatty acid content was determined at arrival at the laboratory by a slightly modified version of the AOCS Ca 5a-40 official method [
5] working with a smaller sample volume (1.0 g). Oil A was a soybean oil with a FFA content of 0.57%. Oil B was a rapeseed oil with a FFA content of 1.23%. The oils were stored at 5 °C until onset of the experiment. The phospholipase A1 type enzyme (Quara
® LowP, Novozymes A/S, Denmark) was used as the reaction enzyme. All other chemicals and reagents used in this study were purchased from Sigma Aldrich (Düsselforf, Germany).
2.1. Enzymatic Degumming
The two batches of crude oil were enzymatically degummed using a laboratory scale batch system consisting of a 250 mL blue cap flask as the reactor and a temperature-controlled water bath with magnetic stirring (Model T100, Grant Instruments Ltd., Cambridge, UK).
A total of 200 g of crude oil was prepared in the blue cap flask and heated to 70 °C at 500 rpm magnetic stirring in a water bath. The heated oil was treated with 650 ppm of citric acid solution (50% w/w, citric acid reagent grade ≥ 99.5%) and mixed in an ultrasonic bath for 5 min at 60 °C (Branson 3510 Ultrasonic Cleaner, Marshall Scientific, Hampton, NH, USA). Upon 15 min of incubation at 70 °C, 3% of Milli-Q® water was added, followed by the addition of 30 ppm of the phospholipase A1 enzyme Quara® LowP. The mixture was ultrasonicated for another 5 min at 60 °C, and the reactor was placed in the water bath at 70 °C for reaction. The reaction was monitored for three hours.
2.2. NIRS Measurements
A sample loop consisting of a flow cell and a peristaltic pump (Minipuls 3, Gilson Inc, Middleton, WI, USA) as shown in
Figure 2A was used. During degumming, the oil was pumped at 48 rpm in silicone tubing (I.D. 3.18 mm), through the flow cell, equal to a flow rate of approximately 28 mL/min.
Spectroscopic data was collected with a Quant FT-NIR (ABB MB3600 FT-NIR) Fourier Transform spectrometer with an InGaAs detector (Q-interline, Tølløse, Denmark) using the HORIZON MB Pro software (ABB, Zurich, Switzerland). A 3D-printed flow cell equipped with a Teflon (PTFE) 7 mm OD/6 mm ID tube was used as sample cell. Teflon was chosen as it is transparent to near-infrared radiation and has excellent performance with the viscous matrix. The light path length was fixed at 2 mm. The flow cell was equipped with two collimators (model F240SMA-1550, Thorlabs, Newton, NJ, USA) connected to the instrument by 3 m low-OH process grade fiber optical cables (Thorlabs, Newton, NJ, USA). A graphical representation of the flow cell is shown in
Figure 2B,C.
The spectrometer was set to record scans in the range from 11,000 cm−1 to 3900 cm−1 (≈910–2775 nm), with a resolution of 8 cm−1 and a digitalization of 4 cm−1, resulting in 1584 data points per scan. A background spectrum of the empty Teflon tube was obtained before the enzymatic degumming process was started.
The software was set to obtain one spectrum as frequent as possible, resulting in a scan every 2–8 s due to software limitations, resulting in collection overhead. In total 3979 and 3913 spectra were obtained from batch A and B, respectively.
2.3. Reference Sampling
Reference samples were grab-sampled every fifth minute in the first hour and every tenth minute thereafter throughout the experiment. Approximately 2 mL was transferred directly from the sample loop silicone tube to a preheated Eppendorf tube. The enzyme was inactivated by heating at 99 °C for a minimum of 30 min using an Eppendorf Thermomixer (Eppendorf, Hamburg, Germany) immediately after sampling. The FFA content was determined using a slightly modified version of the AOCS Ca 5a-40 official method [
5] working with a sample volume of 1.0 g. The volume of alkali used for titration to neutralize the acidity in the samples was used to calculate FFA expressed as oleic acid using the formula in the official method. The relative standard deviation (RSD) of the titration method was 14.57% in the range of 0.1–1% FFA and 9.84% in the range of 1.0–2.0% FFA [
5]. In total, 25 reference samples were obtained from each batch and were stored at 5 °C for 5–7 days until analysis.
2.4. 31P-NMR Analysis
The crude oils were analyzed using 31P-NMR to quantify the different phosphor-containing species present. An oil sample of approx. 250 mg was transferred to a 2 mL Eppendorf tube. An aliquot of 500 µL 2 mg/mL TPP (triphenyl phosphate) in MeOH was added as internal standard, together with 500 µL of chloroform (CDCl3) and 500 µL 0.2 M Cs-EDTA buffer. The tube was shaken for 30 s and centrifuged at 13,400 rpm for 1 min to obtain phase separation. An aliquot of 700 µL of the organic phase was transferred to a 5 mm NMR tube for analysis. All 31P-NMR spectra were acquired at 298 K on a Bruker Avance IIIHD 400 MHz spectrometer equipped with a BBFO room temperature probe. Each spectrum was acquired using 128 scans with a relaxation delay of 5 s. The concentration of each species was calculated as ppm P, i.e., mg elemental P per kg oil sample.
2.5. Data Analysis and Model Calibration
The data analysis was performed in MATLAB R2021a software (The Mathworks, Inc., Natick, MA, USA) using the PLS_Toolbox version 8.9.1 (Eigenvector Research Inc., Wenatchee, WA, USA).
2.6. Data Screening and Removal of Spectral Outliers
Visual inspection of the spectra revealed that several of the acquired spectra were noisy due to particles and bubbles passing through the experimental flow cell. These scans were considered outliers and removed programmatically if (1) the absorbance was higher than 2.99 for more than eight spectral wavelengths (indicating a blockage) and/or (2) a scans’ standard deviation of absorbance evaluated over all wavelengths was found to have a standard deviation lower than 0.15 (air bubble or no sample present).
Data was subsequently averaged in the process time domain by application of a moving arithmetic average algorithm with a varying window width. The window width varied from 0 to 300 s in 30 s interval, always operating on prior data. The moving average algorithm applied the same weight to all spectra in the window.
2.7. Preprocessing of the Spectra
In order to reduce the amount of optical scatter in the acquired scans, three preprocessing methods were examined to determine if any significant reduction in predictive error could be obtained [
19]: MSC (Multiplicative Scatter Correction) [
20], EMSC (Extended Multiplicative Scatter Correction) [
21] and Savitzky-Golay (SG) second order smoothing and second derivative, with widths of 15, 31 and 59, respectively [
22]. All data were mean centered (MC) after preprocessing, and the three preprocessing methods were compared to raw mean centered spectra. The spectral range of 1100–2200 nm was used for model calibration.
2.8. Principal Component Analysis (PCA)
Data were explored by Principal Component Analysis (PCA) to evaluate the process kinetics and changes over time and to retrieve information about systematic variation in the dataset [
23]. A total of 66 PCA models were calculated, representing each combination of window width (0 to 300 s with 30 s intervals) and preprocessing (MC, SG15,2,2 + MC, SG31,2,2 + MC, SG59,2,2 + MC, MSC + MC, EMSC + MC).
2.9. Partial Least Squares (PLS) Regression
Partial Least Squares (PLS) regression [
24] was performed to evaluate the relationship between the near-infrared spectra and the FFA reference values obtained from titration. The preprocessed free run data were reduced to align with the response variables using time stamps, and all non-aligned spectra were excluded. The two batches were combined into one dataset of size 48 spectra × 1178 variables. One PCA model for each preprocessing method was calculated, and outliers were identified programmatically based on the confidence of scores in LV1 (latent variable) and LV2, explaining on average 93% of the data across all window widths and preprocessing methods. Samples outside the 95% (± 2 standard deviations) confidence were removed, and finally the intersection of outliers across the six different preprocessing methods was removed to obtain datasets of equal size for comparison. The final dataset for PLS regression included 45 spectra of 1178 variables.
Calibration models were cross validated using a segmented venetian blinds cross validation, with six segments of four samples each, equivalent to 15–30 min reaction time. The calibration models were evaluated based on the number of components included, the RMSECV (Root Mean Square Error of Cross Validation) and the correlation (as coefficient of determination, r2). The number of latent variables to include was initially estimated by evaluation of the data and then fixed for easier comparison between combinations of preprocessing and average window width.
2.10. Variable Selection
Variable selection was carried out by both recursive PLS (rPLS) [
25] and interval PLS (iPLS) [
26]. Recursive PLS uses the regression vector as a base for weighting the individual variables, taking advantage of the fact that the regression vector reflects the importance of the variables [
25]. In this study, rPLS models including one to five latent variables were calculated for each combination of window width and preprocessing, and the included variables resulting in the lowest RMSECV were chosen for modelling. Interval PLS performs selection of variable intervals based on the RMSECV value obtained for each individual interval of variables and sequentially selects the intervals resulting in the lowest RMSECV [
26]. In this study, the wavelength region (1100–2200 nm) was divided into segments of 100 variables using forward selection.
4. Discussion
The application of NIRS to monitor and predict constituents of vegetable oil has been studied in the literature, especially as related to the production of biodiesel. Rapid in-line and on-line NIRS methods have been evaluated as being successful for monitoring the transesterification reaction in the production of biodiesel from vegetable oil [
10,
12,
14]. Bendini et al. [
31] investigated the use of NIRS to predict fat content, moisture and free acidity (FA) in industrial olive oils. A total of 322 spectra were acquired in transmission mode and preprocessed as first derivative spectra (Savitzky-Golay 17,1,1) and baseline correction. Free acidity was predicted by PLS regression with an RMSECV of 0.03% (
w/
w) and an r
2 of 0.986 in the range of 0.04–1.96% using nine latent variables in an optimistic leave-one-out cross-validation scenario. Rodrigues et al. [
7] predicted the FFA content in refined soybean oil by NIRS using a PLS calibration model that included seven latent variables with an RMSECV of 0.88% (
w/
w) and an r
2 of 0.99, using baseline correction and preprocessed as first derivative spectra (Savitzky-Golay 9,1,1). However, the range of measurements in the investigation covered 0–90% (
w/
w) FFA content, which differs significantly from the much smaller and narrower range investigated in this study (0.34–1.25% (
w/
w) FFA).
For the industrial degumming process, the FFA represent the most obvious and direct chemical components for the establishment of calibration models. In industrial production plants, FFA calibration models will also be straightforward to maintain and update, since the FFA determination by titration is normally available. The present study shows that calibration of NIR spectra to predict FFA content was indeed successful, although partially indirect in nature, since the consumption of water also play a significant role in the calibration models. It is not abnormal nor performance-deteriorating to build PLS regression models on indirect correlations, assuming that the indirect correlations rely on inherent biological or chemical relationships; this may be called the chemical cage of covariance [
32].
During these experiments, spectral acquisition was carried out as fast as the instrument allowed (i.e., every 2–8 s) and subsequently averaged using different window widths to obtain a compromise between experimental noise and process dynamics, with temporal averaging necessary to reduce the experimental noise. In the present study, each spectrum in the window was weighed equally as a simple moving average (boxcar filter). Initially, two other weighted moving average functions were evaluated to investigate the effect of assigning higher weights to the most recent spectra. However, neither linear nor gaussian weighing functions provided better results (data not shown). The results obtained by PCA indicate that a window width of 30 s is sufficient and that a larger window width does not have a significant impact on the explained variance. This indicates that a normal procedure of averaging 16, 32 or 64 scans is suitable in most process monitoring situations. However, in processes with high dynamics, such as enzymatic processes, free-run spectroscopy can aid in the understanding and interpretation of these changes and potentially allow for monitoring rapid changes in the chemical system that would otherwise be hidden.
The calibration routine used in this study was fully automated, i.e. both outlier detection, determination of the number of latent variables and variable selection were done programmatically, with a minimum of or no user input. However, the detection of outliers is based on a 95% confidence interval to exclude extreme samples from the calibration model. As no reliable automatic method exists for the automatic selection of the optimal number of components, we did not automate this process but rather based it on manual selection. An alternative method for outlier detection based on the influence plot (Hotelling’s T2 vs. Q residuals) was investigated, but this procedure did not result in improved prediction models; however, it may have the potential for more robust model generation.
The main challenge in this calibration problem is that the reaction changes fast at the beginning of the process (the first 0–5 min) as is visualized in the PCA score plot in
Figure 5 and
Figure 6. Since reference samples were only collected every fifth minute during the reaction, it is difficult to establish PLS regression models including all samples, as the first reference samples in each batch show outlying behavior due to extreme reference values. This is also evident from
Figure 9, which shows the predictions. A solution to this challenge would be to take out reference samples more frequently during the first 30 min of each batch, e.g., every second minute.
The relative standard deviation (RSD) of the FFA titration method is estimated to be in the range of 10–15%. As the RMSE summarizes both the error arising from the reference and the calibration model, the established prediction models may perform better than the reference method. To confirm this, the FFA content should be quantified using a more accurate method, e.g., GC-MS [
33] or SFC [
34].
Via optical fibers, NIR instrumentation can be installed at one or several places in the process line to enable real-time process monitoring and control. Application of PAT principles such as feed-forward and feed-back systems will enable the possibility of fully automatic process control, e.g., automatic enzyme dosage [
16]. In addition, on-line monitoring of the process gives real-time insight into the process kinetics, as demonstrated in this study. NIRS facilitates rapid and constant monitoring of the process and enables establishment of control charts. Thus, immediate action can be taken, and the production can be halted/changed if the quality is not acceptable, i.e., if the degumming is insufficient. Moreover, if it is possible to monitor the concentration of the different phospholipids, it can be made clear which part of the hydrolysis reaction is not proceeding optimally. This information can be used to optimize the conditions during degumming to obtain a more effective process.