Iterative Multivariate Peaks Fitting—A Robust Approach for The Analysis of Non ‐ baseline Resolved Chromatographic Peaks

: Selectivity in separation science is defined as the extent to which a method can determine the target analyte free of interference. It is the backbone of any method and can be enhanced at various steps, including sample preparation, separation optimization and detection. Significant im ‐ provement in selectivity can also be achieved in the data analysis step with the mathematical treat ‐ ment of the signals. In this manuscript, we present a new approach that uses mathematical functions to model chromatographic peaks. However, unlike classical peak fitting approaches where the fit ‐ ting parameters are optimized with a single profile (one ‐ way data), the parameters are optimized over multiple profiles (two ‐ way data). Thus, it allows high confidence and robustness. Furthermore, an iterative approach where the number of peaks is increased at each step until convergence is de ‐ veloped in this manuscript. It is demonstrated with simulated and real data that this algorithm is: (1) capable of mathematically separating each component with minimal user input and (2) that the peak areas can be accurately measured even with resolution as low as 0.5 if the peak’s intensities does not differ by more than a factor 10. This was conclusively demonstrated with the quantification of diterpene esters in standard mixtures.


Introduction
Analytical separation techniques are one of the most potent tools to measure target analytes in complex matrices. Multiple mechanisms can be used to separate structurally similar compounds while quantifying hundreds of components quickly. The selectivity of the analytical pipeline ("Selectivity of a method refers to the extent to which it can determine particular analyte(s) in a complex mixture without interference from other components in the mixture" [1]) can be enhanced at various steps. Generally, three critical points are considered, the sample preparation, the separation mechanism and optimization, and the detection step. Each of these steps allows removing interferents.
Nevertheless, despite a plethora of tools, ad hoc performances for all analytes remains challenging. Non-baseline separated peaks within the time domain is often observed, resulting in lower precision and accuracy [2,3]. Often overlooked, enhanced selectivity can also be obtained at no cost using mathematical and statistical algorithms [4,5], a step refers as mathematical separation. Different approaches are possible, including peak fitting, mathematical transformations of the signal, and multivariate approaches.
With peak fitting approaches, mathematical functions that approximate chromatographic or electrophoretic peaks are used to model the profiles either with single or multiple mixed peaks. In this technique, mathematical functions [6][7][8][9] that best describe the observed profiles are fitted to the experimental signal. The best match is obtained by optimizing the fitting parameters using a minimizing function, such as the sum of squared residuals (SSR) [10].
SSR= Y xf n x, a 1,n , a 2,n , a 3,n ,… n 2 t end x=t start (1) where x is a specific time, Yx is the intensity of the profile at x, fn is the mathematical function that described the peak n and a1,n, a2,n, a3,n are the fitting parameters to be optimized. The number of fitting parameters will depend on the mathematical function. While peak fitting is valuable to study secondary separation mechanisms [11], it has limited application for the deconvolution of highly mixed peaks as often more than one combination of fitting parameters or mathematical function is possible [12]. Many different mathematical functions have been proposed [9,13], and it remains an active field of research [8]. However, polynomial modified Gaussian functions (PMG) have been often used as a good compromise between low complexity and good flexibility [13,14].
With mathematical transformations of the signal, the original profile is transformed or modified to increase efficiency and resolution. One of the most common approaches is the first and second derivatives to detect co-migration and peaks limits better. In recent work, Wahaba and co-workers increased the resolution using a derivative enhancement method [15], allowing better precision with simulated and real data. The mathematically modified profile is a linear combination of the original and derivative of the signal. Another interesting approach is the power transform of the profile that can significantly improve the resolution with resolution as low as 0.8 [16,17].
With multivariate approaches, higher dimensions data (three or higher) are used. In separation science, this is the case with hyphenated detectors (diode array, mass spectrometry). Three-dimensional data can also be constructed by aligning results from multiple analyses. Multivariate analyses assume that the matrix of data, X, can be expressed as the product of two smaller matrices, the matrix of profiles P (or peaks) and the matrix of response S.
where E is the matrix of error. In the case of hyphenated data, the target spectra in S may be known; in this case, P can be readily obtained and then analyzed using classical chromatographic approaches [18].
where XS −1 is the least-squares solution to the system of equations X•x = S. However, in many cases, both matrixes are unknown. Multivariate curve resolution (MCR) aims to obtain chemically plausible estimates of both matrices when knowing the number of components and using specific constraints [4,[19][20][21]. While MCR has been successfully applied to chromatographic data [22], it is a complex approach requiring users' input and control.
This manuscript proposes a new multivariate approach where the matrix P is estimated using chromatographic mathematical functions and optimized to three-dimensional data. The complexity of the matrix P (the number of peaks) is iteratively increased until convergence is reached. The approach is validated using simulated and real data and compared with MCR. Two mathematical functions to described peaks were tested, the classical Gaussian function and a polynomial-modified function with one (PMG1) or two (PMG2) additional fitting parameters [23].

Theory
MATLAB functions can be used with first and second-order chromatographic data [24]. The second-order could be due to a hyphenated detector or the alignment of multiple samples to a common axis. In this manuscript, a channel refers to a first-order chromatographic profile. Three different mathematical functions to describe chromatographic peaks can be used: Gaussian, PMG1 and PMG2 [25], with the PMG function defined as [14] ℎ (4) where H0 is the height at the peak maximum, tr is the time at the peak maximum, s0 is the standard deviation of the symmetric component and s1 and s2 described the peak distortion [23]. With PMG1, s2 is set to zero, while with PMG2 both s1 and s2 can vary.
The iterative multivariate peaks fitting algorithm is described below: 0. Initialization step. The average of the signal Yav(t) as a function of time is calculated by averaging all the channels. Next, the time at the maximum response for Yav and the variance assuming a single peak is measured. In the first column, the matrix P is populated with a constant (constant baseline drift), and in the second column, the initial peak profile. Finally, the starting matrix P is estimated using the initial measured position and variance with the selected mathematical function (by default PMG1). Additional fitting parameters, if any, are set to their minimum values. All columns in P are normalized to one (H0 = 1 in Equation (4)). 1. Optimization step. The minimization function is obtained by estimating X, Xest using: The minimization function is the mean sum squared residual (MSSR) between X and Xest in each channel. Fitting parameters are optimized using either the Simplex method [26] (fminsearch function in MATLAB [27]) or the Quasi-Newton method [28] (fminnunc function in MATLAB [29]). The goodness of the fit is assessed via the MSSR.
2. Iteration step. The average residua between X and Xest are measured. Then, a new column is added to P corresponding to a new peak. The position of the peaks depends on the residual, while its shape is the average of all other peaks. 3. Optimization and termination conditions. Before optimization, and to avoid local minima, the variance of all peaks is decreased by a set factor. The optimization (step 1) is repeated, and the termination conditions are tested. If validated, the routine is stopped; otherwise, the algorithm loops to step 2. Different termination conditions can be used. The algorithms will be terminated if (1) the MSSR obtained in step 3 does not decrease by more than a set value (by default 5%) in comparison to the previous value obtained with one peak less, the maximum intensity of the new peak is less than x% of the most intense peak (by default 5%), or (2) the minimum resolution is less than a set value (by default 0). Because local minima can be reached, variations in the initial conditions are possible at step 2. Optimizations will be run simultaneously, and the result with the lowest MSSR will be used.
A short tutorial is available in Appendix.

Exploration of Data
The datasets consist of four matrices with four components with known profiles and concentrations. The data can be visualized in Figure 1, with Figure 1A,C,E, and G being the superposition of the signals from all channels for D1nh, D2nh, D3nh and D4ng, respectively, and with Figure 1B

First-order vs. Second-order Iterative Peak Fitting
D1nh was used to compare first and second-order iterative peak fitting. For first order, the intensities from all channels were averaged. The Simplex method (fminsearch) was used to optimize the fitting parameters. The iterative process was stopped when adding an additional peak does not decrease the MSSR by more than 5% compared to the previous iteration. Results are presented in Figure 2.
With first-order data, convergence was obtained with two peaks with a Gaussian model (Figure 2A.I), one peak with a PMG1 model ( Figure 2B.I) or two peaks with the PMG2 model (C.I). While proper fitting of the experimental profile can be observed in all cases, the four expected components were not mathematically separated. It was not the case when using the second-order data. Independently of the function used, similar results were obtained, with the four components separated. Slightly better results were obtained with PMG2 ( Figure 2C.II, MSSR of 648), followed by PMG1 ( Figure 2B.II, MSSR of 654) and Gaussian models (Figure 2A.II, MSSR of 656). However, the computation time varied from 9.5, 37.8 and 85.7 s for Gaussian, PMG and PMG2 models, respectively. It should be emphasized that first-order peak fitting is highly dependent on the channels. When selecting specific channels rather than averaging the signals, different results (including number of peaks) were obtained in Figure 2, A1-C1. Second-order peak fitting, provide robust deconvolution as the optimization involve all channels.  The iterative multivariate peak fitting (itMPF) was then used, with the PMG1 function, with the other three datasets. The results can be seen in Figure 4. While the four components in D2nh ( Figure 4A) and D3nh ( Figure 4B) were not mathematically separated, itMPF was successful with D4nh ( Figure 4C). Moreover, better agreements were obtained than with D1nh between measured and true intensities (r 2 of 0.989, 0.973, 0.970 and 0.984, respectively).

Validation with HPLC-DAD Separation of Diterpene in Coffee
The peak fitting approach was applied to the simultaneous analysis of diterpene esters in coffee by HPLC-DAD. Typical separations are presented in Figure 5A, with, in black (a) the separation of a standard mixture of (1) kahweol oleate, (2) cafestol oleate, (3) kahweol palmitate and (4) cafestol palmitate. The profile is obtained by averaging the intensity of all wavelengths. The profiles in red (b) and green (c) correspond to the separation of a coffee sample obtained either by averaging the intensities at all wavelengths (b: 200-400 nm) or using a range selective to kahweol esters (c: 286-294 nm).
ItMPF was first used with standard mixtures of the four esters. The superposition of the profile at each wavelength is presented in Figure 5B. As a significant fluctuation of noise can be observed [30], data were normalized by the noise, estimated as the standard deviation, measured between 18 and 18.5 min of the lowest concentrated standard ( Figure  5C). The four peaks between 18 and 22 min in the standards were separated using itMPF with the PMG1 function. The ability of the software to measure the area correctly was assessed by the figures of merits (FOM) of the linear calibration. Those can be compared in Table 1 using fminsearch (simplex method) and fminnunc (quasi-Newton method) optimization algorithm. As a reference, the kahweol oleate and palmitate were quantified free of cafestol esters interferences by using the wavelength range 290 ± 4 nm [18,31] (see Figure 5A (c). Peak areas were measured using the trapezoid rule [32]. While the best results for KO and KP were obtained using classical approaches, itMPF gave good results for quantifying the four diterpene esters is with an average resolution between peaks of 0.83 (min 0.77, max 0.92). With those data, slightly better results were obtained using the fminnunc optimization algorithm.
It should be emphasized that at low concentrations, more than four components are obtained. Additional components are used to better fit the baseline drift. Such an example can be seen in Figure 6. In panel A, the real data is presented with the superposition of the profile and each wavelength. Panel B shows the superposition of the fitted channels, and panel C shows the peak profiles extracted by itMPF to model the experimental data. While peaks 1 to 4 correspond to the four expected diterpene esters, an extra peak is obtained to describe the baseline better. This peak can easily be recognized due to its substantial peak variance (760 min 2 vs. 0.023 ± 0.001 (n = 4) min 2 for peaks 1 to 4).

Discussion
Iterative multivariate peak fitting is an attractive alternative to other multivariate approaches for chromatographic data. It is relevant because the number of components does not need to be known and allows quantifying individual peaks event with resolution as low as 0.5. However, the minimal resolution for good accuracy will depend on the relative intensity of the different signals and the variation of intensity along the extra dimension. While fminnunc seems to offer slightly better performance than fminsearch as optimization algorithms, computing time is a key issue. itMPF is fast when few peaks are fitted to the signals (less than 1 min for three components or less) but become slow when a higher number of peaks is used. For example, the algorithm took more than 40 min to optimize the fitting of nine peaks with a PMG1 function using the data from coffee samples. The complexity and number of fitting parameters also have a strong influence. With a Gaussian function, minimal MSSR was reached after 8 min with nine peaks. Nevertheless, this approach could be used to entire chromatograms by splitting the signals into smaller parts with few peaks.
One of the key benefits of this approach is that the number of components is automatically optimized from the data. Moreover, the baseline can be automatically detected and fitted with an extra function that must be detected and removed. The toolbox should be improved by using a more extensive set of functions.
The algorithm has been tested in multiple conditions and, most of the time, give realistic results. However, in some examples, negative peaks have been obtained. To limit their occurrence, a penalization factor for negative intensities was introduced. This factor is only used when calculating the MSSR. Constrained have also been introduced to force all peaks to have the same shape or limit the variation in peak variance within a set interval. A short tutorial is available in Appendix.

Simulated Data
The simulated HPLC-DAD non-trilinear data sets were obtained from [33]. The four matrices were used (D1nh, D2nh, D3nh and D4nh). Each matrix has been generated as Di = ci.s + noise where ci (51x4) is the matrix of profiles (four peaks, with different positions and shapes in each dataset), s is the matrix of spectra (4 × 96) corresponding to each peak (s is the same in all datasets) and noise is the noise.

Real Data
Real data come from a published work measuring kahweol oleate, cafestol oleate, kahweol palmitate and cafestol palmitate in coffee brew [18,31]. Briefly, coffee brews were extracted using 5 mL of diethyl ether. The mixture was vortexed for 2 min, and after centrifugation, the upper phase was transferred to a clean test tube. Next, the aqueous solution was re-extracted using diethyl ether, and the combined ether phase was washed with 5mL of 2 M NaCl solution followed by centrifugation (4000 rpm, 10 min. Finally, the clean ether phase was dried under an azote stream. Samples were kept at −22 °C until analysis using LC-DAD. Separation was achieved using a Purospher STAR LichroCART RP 18 end-capped (250 × 4 mm, 5 μm) column attached. Before injection, the dried extract was dissolved in 2.5 mL of acetonitrile and filtered through 0.45 μm filter membrane (PTFE, VWR, Radnor, PennsylvaniaUSA). Twenty microliters of sample were injected, and the separation was achieved using isocratic conditions for 35 min with the mobile phase made of acetonitrile/isopropanol (70:30, v/v) and pumped at 0.4 mL/min. A diode array detector, in the range of 200-400 nm, was used. After each run, the acquisition software exported data as a comma-separated-values format (EZChrom Elite 3.1.6).
For the calibration curves, mixtures of the four diterpenes esters at similar concentrations were used in the range of 2 to 200 mg/L. Eight samples were prepared and run in duplicates. All data used in this work are available in the Zenodo repository (https://zenodo.org/record/5412345 )

Programming and Software
MATLAB R2020a (Mathworks, Natick, Massachusetts, USA) was used for this work; functions were programmed and run using a PC equipped with an Intel Core i7 CPU (2.80 GHz) and 18.0 GB RAM. The functions fminsearch [27] and fminunc [29] from the optimization toolbox were used. The functions IterativeMethod1_fminsearch and Iterative-Method1_fminunc were designed for this work and are available free of charge in the GitHub repository: https://github.com/glerny/itMPF [34].   Options.PointsPerPeaks = [25 75]; the average number of points per peak, used to smooth the residual when adding a new peak and avoid spikes. More than one value can be used to induce variations in the position of the new peaks and avoid local minima.  Options.MinMax = 0.05; Termination condition. If the maximum intensity of any peak is lower than Options.MinMax time the intensity of the most intense peak the function stops.  Options.Robust = false; If this option is used, an additional peak will still be tested when a termination condition is true. If with additional peaks the termination is not true anymore, the algorithms will continue. In output, four elements are obtained:  Model.Peaks is a structure that contain the name of the mathematical function, the fitting parameters, the intensity at each channel and the baseline intensity  FittedChannels is a [mxkxn] matrix with the intensity as a function of time for each element  Stats is a structure with the Options used, the computing time, the number of peaks separated, the end conditions and the means sum square residual.  myModel is a [mxk] matrix with the normalized peaks model.