The Cascade Hilbert-Zero Decomposition: A Novel Method for Peaks Resolution and Its Application to Raman Spectra

: Raman spectra of biological objects are sufﬁciently complex since they are comprised of wide diffusive spectral peaks over a noisy background. This makes the resolution of individual closely positioned components a complicated task. Here we propose a method for constructing an approximation of such systems by a series, respectively, to shifts of the Gaussian functions with different adjustable dispersions. It is based on the coordination of the Gaussian peaks’ location with the zeros of the signal’s Hilbert transform. The resolution of overlapping peaks is achieved by applying this procedure in a hierarchical cascade way, subsequently excluding peaks of each level of decomposition. Both the mathematical rationale for the localization of intervals, where the zero crossing of the Hilbert-transformed uni- and multimodal mixtures of Gaussians occurs, and the step-by-step outline of the numerical algorithm are provided and discussed. As a practical case study, we analyze results of the processing of a complicated Raman spectrum obtained from a strain of Mycobacterium tuberculosis . However, the proposed method can be applied to signals of different origins formed by overlapped localized pulses too.


Introduction
Raman spectroscopy is an effective method for studying the molecular structure of a substance; it can be used in a wide range of scientific applications for organic [1,2] and inorganic [3,4] matter investigations. Biological molecules have a complex nature and structure [5][6][7]. Due to this fact, the spectrum of Raman light scattering is complex and consists of many components. Moreover, their typical property is a significant widening of peaks, which originates not only from the instrumental response function, but also from complex biochemical structure and interactions [8][9][10]. Such complex spectra often belong to heterogeneous systems, such as cells, tissues, bacteria. This makes the task of analyzing such spectra similar to the task of regressing continuous (albeit noisy) functions formed by a superposition of wide bell-shaped curves (their width under different biophysical conditions also carries meaningful information) rather than revealing positions of thin well-localized peaks in Raman spectra of pure chemical components for which effective methods of the discrete spectrum reconstructions are developed; see, for example, [11][12][13]. For the successful identification, classification and quantification of desired features,it is not enough to obtain a high-resolution raw spectrum, but to perform complex pre-processing procedures using commercial or laboratory-made software and algorithms. Minor or no spectral variation can be observed in raw spectra describing different structures in biological objects. In that case, to obtain meaningful spectral information, we need to process and analyze the data in detail. Univariate and multivariate methods can be applied for it. Univariate methods can include first-and second-order derivatives [14], curve fitting [15], component-wise spectral analysis [16], various bands intensity, peak broadening [17], change in intensities, etc. The multivariate methods are efficient for huge datasets [18] and investigations of subtle changes in spectral data. These methods include principal component analysis (PCA) [19], linear discriminant analysis (LDA) [20], multiple linear regression (MLR) [21], cluster analysis (CA) [22], partial least squares regression (PLS) [23] and can be performed for Raman spectroscopy spectral analysis [24]. On the other hand, applying standard univariate methods for the analysis of subtle spectral differences can be a problem for multicomponent spectra investigations [25]; the multivariate methods need a large amount of data [26], adversely affecting the speed of analysis in the case of practical applications.
As spectral pre-processing methods are successfully used for solving combined applied biophysical and medical problems [27,28], there is a demand for the development of fast and accurate new mathematical methods for Raman spectral analysis, due to the problem of resolution of overlapping peaks. That is especially important in the case of spectral analysis, which draws active attention recently from the point of view of integral transforms, e.g., the wavelet transform [29,30]. This kind of multiscale approach uses possibilities of an extended 2D resolution over multiple scales of a signal's smoothing that improves the accuracy of peak detection from the zero crossings and ridges of the smoothed signal's derivatives. However, in addition to the complicated procedure of 2D image processing, its applicability is limited in the cases of completely merged peaks, i.e., when several peaks form asymmetric monomodal ones as well as for highly noisy data typical for Raman spectra of biological objects.
Additionally, there are some methods that were applied for peak resolution in signals of other origin. The first attempt at using the Hilbert transform for the localization of peaks in signals was proposed in the work [31] within the context of determining the maxima of cross-correlations of signals. However, its author noted the difficulties strongly affecting the results in the case of multiple overlapping peaks and wide peaks of asymmetric shape. On the other hand, such a methodology was successfully applied to detecting R-peaks in electrocardiogram signals, where these peaks are significantly narrow and separated by sufficiently long time intervals [32]. Further, this approach gained popularity and various improvements were proposed, e.g., [33,34].
The idea to expand a function into a series with respect to shifts of the Gaussian function (note that fitting a set of Gaussian functions to a complex spectrum of chemical substance was discussed [35,36] within the context of optimization algorithms, allowing the resolution of close peaks and the analysis of the corresponding residuals) was mathematically justified in [37], where so-called "approximate approximation" by means of various shift-invariant systems was developed. However, our construction is not based on a shift-invariant system. It is adaptive to the form of the function and allows for different dispersion of the Gaussian functions.
Thus, the work is structured as follows: we introduce a mathematical justification for adapting Hilbert's approach to recognize overlapping peaks, then demonstrate it on spectra obtained experimentally on bacteria Mycobacterium tuberculosis causing tuberculosis.

Basic Properties: A Single Gaussian Function
Let us consider the following Gaussian function: normalized for the illustrative clarity in such a way that is has the unit amplitude maximum located at the point x 0 and the dispersion σ. Its Hilbert transform is defined as the following: and results in the Dawson function as follows: which has zero value at the point x = x 0 . See an example in Figure 1 for Equation (1) with x 0 = 5 and σ = 1.5. Thus, the zero crossing of the function H[g](x) can be used to localize the Gaussian's maximum. Taking into account that the Gaussian function is fast decaying, i.e., 60.6% − 13.5% − 1.1 % of the magnitude maximum for |x − x 0 | equal to σ − 2σ − 3σ, Gaussians separated by a distance exceeding several dispersion values can be considered practically independent, and distances between their peaks can be found by searching for the distances between zeros of the Hilbert transform of such a sum of Gaussian functions as was proposed in Ref. [32]. Figure 2A illustrates such a case for the combined function as follows: with the amplitudes A 1 = 1, A 2 = 0.5, the dispersions σ 1 = 1, σ 2 = 0.5, and the peak positions x 01 = 5, x 02 = 15. However, when peak positions are close together, i.e., Gaussian component functions are overlapping, the situation is much more non-trivial and requires a special consideration.

Zero Crossing for the Hilbert Transform of Overlapping Gaussian Components
As it is seen in Figure 2B,C the closeness of the Gaussian components leads to vanishing separated uni-modal components accompanied with the reduction in the Hilbert transform's zero crossing to one point, which is clearly separated from the positions of any peak of the individual Gaussian components. More precisely, let us consider the sum of two Gaussians f (4), where for simplicity we put Without loss of generality, we assume A, x 0 , σ > 0. We take the Hilbert transform of f and claim that for sufficiently small shift x 0 , the Hilbert transform of f has a unique root.
x 02 = 6.5, and their Hilbert transforms (dashed lines). Vertical dash-dotted lines indicate peak positions for individual components.
where q is the maximum point of the Dawson function, q = ArgmaxD ≈ 0.92, then H[ f ](x) has a unique root.
Proof. Taking the Hilbert transform of f , we obtain the sum of the Dawson functions as follows: It follows from the properties of the Dawson function that For an arbitrary function with some strongly marked clusters of peaks and without any symmetry with respect to the vertical axis, the situation is much more complicated. However, it is possible to justify the following statement ( Suppose there exists a positive number u > 0 such that f decreases in (−u, u), and u satisfies the following inequality: where c := min f (x), and m a := min By the definition of the Hilbert transform, we obtain the following: By the Lagrange theorem, we obtain the following: Since f decreases in (−u, u) it follows that c ≥ 0. We estimate the remaining integrals as follows: Collecting all the estimates together, we conclude that the inequality (5) completes the proof.

Remark 1.
Analogously, it can be checked that if f increases in (−u, u), and u satisfies the inequality

Resolution of Peaks at One Level of Resolution
The method which we propose for the decomposition of a multi-modal function f (x) consists of the following steps.
(I) Compute the Hilbert transform of the function f (x), i.e., H[ f ](x). This can be easily carried out for a discrete data sample either straightforwardly applying the fast Fourier transform or by another appropriate numerical method [39] depending on the signal's complexity.
(II) Find all zeros of the function H[ f ](x) found at the previous step such that H[ f ](x) crosses zero from negative to positive values.
Numerically, the simplest way to do this is to check the zero crossing when one of the subsequent data points is negative and another is positive. The latter condition is based on Theorem 2 and Remark 1: if the Hilbert transform intersects the axis, i.e., H[ f ](x) = 0, then in a sufficiently large neighborhood of such a point on the left-hand side, the original function cannot increase, and on the right-hand side, it cannot decrease.
(III) Around each appropriate j-th zero crossing point, find an interval within which the Hilbert transform with 98 % correlation corresponds to a linear fit on this interval, and fit the processed function on this interval by the following Gaussian: by a non-linear regression that gives its amplitude B j , variance σ 2 j , and the corrected point of maximum x 0j .
The necessity of such an approach is again based on the important fact noted above: the zero crossing point does not coincide with the true peak position if the signal is asymmetric, e.g., as a consequence of merging several close positioned wide elementary components. Thus, the optimization procedure shifts x 0j to the actual position. The choice of the interval for this optimization procedure is based on the properties of the Hilbert transform of the Gaussian (1) given by the Dawson function (3) as follows: One can see in Figure 1 that this interval covers the peak's vicinity practically up to the boundaries of the peak's dispersion, i.e., the optimization procedure for searching the parameters of the nearest Gaussian function closest to the found point of the Hilbert transform's zero crossing acts within the principal part [x 0 − σ, x 0 + σ] of the desired Gaussian function.
(IV) Sum up all the Gaussian components found as f 1 (x) = ∑ G j (x) and compare the obtained approximation with the processed function. Figure 3 illustrates the results of applying the steps described above to three test signals shown in Figure 2. The case of Figure 3A consists of two sufficiently separated Gaussians that result in two zero crossings of its Hilbert transform; see Figure 2A. Both these components are detected and the processed signal is recovered perfectly. The parameters (B j , x 0j , σ j ) found for the first and the second component are (1.02, 5.02, 1.02) and (0.49, 7.00, 0.50), respectively.
For two more complicated cases, Figure 3B,C, only one zero crossing, and, moreover, at the point that does not coincide with any of the maxima of individual components, occurs; see Figure 2B,C. As a result, only one Gaussian component can be recovered for these cases. At the same time, it is is clearly seen that the procedure of the non-linear fitting-based optimization of the recovered peak's position accurately shifts the revealed single Gaussian component to its true position. When both Gaussian components overlap but two peaks are still visible, Figure 2B, the reconstruction of the left one quite accurately follows the processed signal, where an influence of the right Gaussian is negligible. When both peaks are so close that the resulting signal is a monomodal asymmetric peak, Figure 2C, the left component differs from the shape of the overall curve.
Thus, the next stage required is to supply this primary analysis with additional steps aimed at revealing the rest of the components.

The Cascade Revealing Hidden Peaks
As seen in Figure 3, the part where a component (or several components) associated with zero crossings of the original signal's Hilbert transform dominates is reproduced with high accuracy. Therefore, the task of reconstructing the rest of the components can be realized as a cascade of the following steps.
(V) Find the difference between the original signals and the signal reconstructed during the steps (I)-(IV), ∆y(x) = f (x) − f 1 (x), and consider this the new function to which the steps (I)-(IV) should be applied again. Figure 4 illustrates this step. One can see that the difference between the initial function and the actually found Gaussian has now the form of a bell-shaped curve. Respectively, the Hilbert transform taken from this residual again exhibits zero crossing. For the case of more separated Gaussians forming the initial signal, Figure 4A, this residual looks like an accurate Gaussian, has a unique zero crossing of its Hilbert transform, and the procedure of the component's fitting is simple and straightforward as above. Figure 4B illustrates a more complicated situation of more closely positioned components in the initial signal. Here, one can see a more complicated shape of the Hilbert transform, which has two zero crossings. However, as it is argued in the description of step (II), the first one should not be considered since it is the zero crossing by the decaying part of the Hilbert transform. Note also that the plot of the residual around this point demonstrates ∆y < 0, i.e., one cannot consider this "hump" as a hidden peak because we are interested in the positive peaks only for our particular class of problems. The second zero crossing goes in the required direction, and the non-linear fitting procedure restores the desired second Gaussian component of the signal.
Finally, the full combination of revealed peaks can be constructed. However, since the second peaks' magnitude can significantly differ from zero at the point of location of primary peaks, we need the step of internalization of amplitudes for the peaks found during previous steps, similar to the correction step of the lifting scheme for constructing wavelets [40]:

Materials, Methods, and Data
As an example, we consider the experimental profile for an XDR strain of M. tuberculosis from the work [6], where details of data origin and experimental procedure are described. Figure 6 shows the analyzed real Raman spectrum as a function of the wavenumber κ and indicates Gaussian components as follows:

Resolution of Peaks
found by the proposed method within three iterations. The signal was normalized to unity instead of 100 used in the work [41] for the sake of convenience. Here, we used the numerical factor 6 for comparison of the interval of local fitting and the obtained dispersion of the Gaussian functions (see the description of step (III) above) for rejecting suspiciously too wide peaks. In addition, peaks with amplitudes less than 0.05 were also rejected, comparable with the noisy background and deviations originated from imperfect reproduction of the constituent peaks' tails.
One can see that the envelope (dashed black curve) representing Equation (6) accurately reproduces the shape of the whole sequence of Raman peaks with amplitudes larger than 0.05, including ones which have a complicated asymmetric shape with "wide shoulders". The picture of individual components demonstrating such peaks can be considered to comprise several close-positioned Gaussians. The full set of parameters for the individual components revealed is listed in Table 1.
These components are found during three iterations illustrated in Figure 7. Note that their amplitudes shown in this figure correspond to the non-linear fitting at the current iteration (step (III) in the description above), while in Figure 6 and in Table 1 they are rescaled to keep the magnitude of the whole signal (step (VI)).

Discussion
We have developed a new approach that allows more accurate spectral analysis in comparison with those available earlier. We confirmed this by applying the approach to already published Raman spectra aimed at the investigation of the cell wall of Mycobacterium tuberculosis [6]. The attempt described in the cited paper to apply the commercial KnowItAll Vibrational Spectroscopy Edition (BioRad, Hercules, CA, USA) software, which is de facto standard in biophysical Raman-based studies, to the experimentally obtained curves resulted in revealing 15 identified peaks. The results of our new approach lists them also in the set given in Table 1 with the same accuracy of resolution in wavelengths that validates the new method. However, also Table 1 provides information on four additional peaks, which correspond to actually existing biochemical structures; see below.
We have revealed new spectral vibrations in the composition of the complex spectrum of mycobacterium: 1029 cm −1 for L-valine; 1042 cm −1 for L-glucamate; 1088 cm −1 for D-(C)-fucose; 1119 cm −1 for D-glucose; 1302 cm −1 for triolein/trilinolenin. This information can be used for detailed analysis of the cell wall and life cycle of mycobacteria. Thus, the amino acid L-valine and its spectral characteristics can be used for the analysis of enzymes associated with it [52], the analysis of the mechanisms of their synthesis [53]. D-glucose is also an element of the cell cycle and it can be used as a biomarker of its description [54]. Glutamate is one of amino acids that act as cellular nitrogen donors for the synthesis of biomolecules inside the cell. In mycobacteria, the assimilation of synthetic nitrogen and its conversion to glutamate is carried out by glutamate synthetase. The cell wall of pathogenic mycobacteria is known to possess a poly-L-glutamine layer. Poly-L-glutamine layer synthesis has been directly linked to the glutamine synthetase enzyme [55]. Due to this fact, glutamate spectral characteristics can be used as a biomarker in cell wall changes. D-fucose can be found in both Gram-negative and Gram-positive bacteria, in which it is a constituent of the cell wall and capsule structures [56]; this fact can be used for strain discrimination. The triolein and lipolytic protein have roles in the pathogenesis and survival of M. tuberculosis [57].
It should be noted that the fast and reliable identification and discrimination of Mycobacterium tuberculosis (MbT) bacteria species and its drug resistance is a serious global problem today. The successful spectral determination of a specific MbT strain and its intra-strain differences is vital for the appointment of anti-tuberculosis therapy [58].
Note that the developed approach may lead to inadequately wide Gaussians as outputs in the case when a processed signal contains strong noise in the vicinity of a narrow peak. This problem can be avoided by the control of output peaks' width by their comparison with the interval of the fitting. As it is seen in Figure 1, the fitting interval [x 0 − σ, x 0 + σ], where the Hilbert transform of the function is almost linear, is comparable in width with a true Gaussian's dispersion. This assures the robustness of fitting as defined by the classical criteria of linear regression. On the contrary, if a revealed Gaussian has a dispersion much wider than the fitting interval, then it should be rejected. As well, two peaks cannot be resolved if the distance between their maxima is either less than the minimum of two σ or comparable with the recorded instrumental resolution.
Nevertheless, the developed approach can be convenient in revealing hidden peaks in the spectra of biological objects, which is extremely useful in revealing subtle differences between cells or cell populations. Moreover, an application or program created on the basis of the approach can be made available to a wide range of users working in experimental spectroscopy.

Conclusions
While the vast majority of peak localization algorithms solve the detection problem for relatively well-expressed maxima in signals, we have developed a new approach using the Hilbert transform, which allows us to determine with high accuracy the hidden peaks too. This problem is especially important for spectra obtained from biological samples. We verified this method using Raman spectra obtained on bacterial cells, which made it possible to identify additional chemical substances in the composition of the cell wall. It could be helpful for further interpretation of the spectra in detail to find differences between the strains more precisely.
Finally, the knowledge of not only the separated peaks' localization, but also their amplitudes and characteristic width can be useful for further data interpretation by modern methods, which use machine learning, requiring a larger set of data characteristics.