Adaptive Savitzky–Golay Filters for Analysis of Copy Number Variation Peaks from Whole-Exome Sequencing Data

: Copy number variation (CNV) is a form of structural variation in the human genome that provides medical insight into complex human diseases; while whole-genome sequencing is becoming more affordable, whole-exome sequencing (WES) remains an important tool in clinical diagnostics. Because of its discontinuous nature and unique characteristics of sparse target-enrichment-based WES data, the analysis and detection of CNV peaks remain difﬁcult tasks. The Savitzky–Golay (SG) smoothing is well known as a fast and efﬁcient smoothing method. However, no study has documented the use of this technique for CNV peak detection. It is well known that the effectiveness of the classical SG ﬁlter depends on the proper selection of the window length and polynomial degree, which should correspond with the scale of the peak because, in the case of peaks with a high rate of change, the effectiveness of the ﬁlter could be restricted. Based on the Savitzky–Golay algorithm, this paper introduces a novel adaptive method to smooth irregular peak distributions. The proposed method ensures high-precision noise reduction by dynamically modifying the results of the prior smoothing to automatically adjust parameters. Our method offers an additional feature extraction technique based on density and Euclidean distance. In comparison to classical Savitzky–Golay ﬁltering and other peer ﬁltering methods, the performance evaluation demonstrates that adaptive Savitzky–Golay ﬁltering performs better. According to experimental results, our method effectively detects CNV peaks across all genomic segments for both short and long tags, with minimal peak height ﬁdelity values (i.e., low estimation bias). As a result, we clearly demonstrate how well the adaptive Savitzky–Golay ﬁltering method works and how its use in the detection of CNV peaks can complement the existing techniques used in CNV peak analysis.


Introduction
Copy number variation (CNV), which includes DNA segments longer than one kilobase pair being amplified or lost, is a significant class of DNA structural variants [1].The mutation rate of CNV loci is significantly higher than that of SNPs across the entire genome, making CNV an important pathogenic factor causing human complex diseases [2].The three main technology types that can generate data sets for the detection of CNVs are array comparative genomic hybridization (aCGH), SNP array, and next-generation sequencing (NGS).NGS can perform target capture sequencing such as whole-exome sequencing (WES).This is one of the techniques widely used to extract genomic information from a specific exome region of interest using customized probes in clinical diagnosis [3].This technique is cost-effective and provides much higher coverage than whole-genome sequencing (WGS) for the identification of rare variants and can provide exceptional prospects for researchers and patients [4].However, some of the limitations of this technique include the following: first the information is not continuous (i.e., not every position is covered such as intergenic regions, large introns, promoters they are ≈99% of the genome, WES only covers ≈1%); secondly, targeted genomic coordinates (exonic and coding sequences) of different kits equivalent the precise positions of the designed hybridization oligos; third, multiple combinations of oligos exist that could capture the same target region by different kit design based on the position and shape of the coverage profile of sequence data resulting from these kits and only loosely corresponds with the genome coordinates of targeted regions [4,5].Lastly, the exact position of the hybridization oligos is unknown, hence they could only be inferred from the experimental NGS sequence data [6].Thus, due to the unique features of NGS, the CNV detection methods involving targeted NGS data could be divided into four categories, notably, split-read-, de novo assembly-, pair-end-mapping-and read depth (RD)-based approaches [7,8].The read depth-based approach is more robust in detecting CNVs of any sizes other among three categories.This method's main tenet is to detect CNVs based on the variation of read depths a cross the genome to be investigated [9].Currently, there are several techniques for detecting CNVs using RD values; however, these techniques frequently have unique characteristics and limitations [10,11].The analysis of CNVs peaks with small amplitudes is still challenging due to factors such limited coverage depth and GC content bias, despite the techniques' considerably strong performance [12].Therefore, an appropriate filtering method is necessary to filter the CNVs peaks.Several filtering methods have been proposing for peak processing, including spline smoothing [13], Stein's unbiased risk estimate (SURE) [14], Fourier transform [15], Gaussian Kernel filtering [16], Epanechnikov Kernel [17], Lowess [18], Savitzky-Golay [19] and discrete wavelet transform [20].Some of those filtering methods experience computational challenges when handling highly corrupted signals.Savitzky-Golay has been widely used in biomedical electrocardiogram (ECG) signal processing due to its ability to achieve high signal-to-noise ratio and retains the original shape of the signal [21].Though this method has been widely used for signal processing, no study has reported its application in CNV peak detection and analysis.Currently, some of popular methods for detecting and analyzing CNVs include, but is not limited to, CNVkit [22], Control-FREEC [23], iCopyDAV [24], PEcnv [25] and CNV_IFTV [26].Each of these methods has its own characteristics and advantages.For example, CNVkit uses both the targeted reads and the nonspecifically captured off-target reads to infer copy number evenly across the genome to identify copy number changes based on we evaluation of three sources of bias in the sequencing read depth: GC content, target footprint size and spacing, and repetitive sequences [27].Control-FREEC most effectively utlizes GC-content to normalize the read count profile so as to find out CNV regions, and iCopyDAV chooses an appropriate bin size and uses thresholds for RD values to declare CNVs.Although much effectiveness has been achieved by these methods, limited coverage depth and GC-content bias still pose a big challenge to the detection of CNVs with small amplitudes.Therefore, it would be necessary and meaningful to seek for new methods that can grasp the essential characteristics of sequencing data associated with CNVs.In this study, we proposed an adaptive Savitzky-Golay filtering method for the detection and analysis of CNV peaks obtained from WES data.The motivation for the underlying axiom of our novel approach is as follows: it uses the existing concepts of local polynomial regression (LPR) and least squares criterion (LSC) to model the peak distribution function.It provides a generic framework for an adaptive Savitzky-Golay that automatically chooses the polynomial order and window length based on the peak distribution, allowing for accurate smoothing of peaks with high rates of change as well.It consider peak positions in each segment by calculating local density and minimum distance in order to extract two related features from the CNV peaks profile.Finally, using a multivariate Gaussian distribution, It calculate the associated p-value for the CNV peak from the feature values.We conducted numerous simulation studies to evaluate and compare our approach to peer methods.The experimental findings show the effectiveness of the method.
The rest of this article is organized as follows: In Section 2, we provide a model equation of peak distribution function, then we present a mathematical formulation of the classical Savitzky-Golay filter and adaptive Savitzky-Golay filter with respect to CNVs peak function; next, we describe a new formulation of feature extraction for CNV peak detection and analysis.In Section 3, we present the results of model performance evaluation and its application to real WES data generated for germline mutation analysis.In Section 4, we summarize the discussion of the proposed method's and genetic implication with respect to CNV peak detection and analysis.In Section 5, we conclude and outline our plans for future work.

Peak Distribution Function
Generally, peak distribution tends to be asymmetrical in nature; therefore, it is important to develop a function that is applied to a wide class of the peak distribution.Let us consider the existing concept of local polynomial regression (LPR) [28] and the least squares criterion (LSC) [29]; thus, the peak distribution function is given by equation where f (S i ) is the main peak, f 0 (S i ) is the noisy peak and w(S i ) is identically distributed (iid) additive white Gaussian noise of mean zero and variance σ 2 , S is the segment.To keep things simple, we will represent a peak f 0 at i th segment by f i f (S i ) and w at i th segment by w i w(S i ).

Classical Savitizky-Golay Filtering
In this subsection, we provide a summary of the mathematical formulation of classical Savitizky-Golay filtering, based on the work of [30].We first perform the polynomial fit to obtain the filtered output value by computing the polynomial coefficients at the central index of the approximation window.We then consider a symmetrical window length M = 2m + 1, i = −m, . . ., λ, . . ., m with data point x at a reconstruction point λ represents the index of the middle point at 0. Thus, the k th order of the polynomial P is calculated by The aim is to fit a polynomial of order P = N ∑ k=0 f k (S i ) k in a least square manner by minimizing the cost function using equation To obtain data point at the central index 0 with zeroth polynomial coefficient as y 0 = P 0 = f 0 , we calculate an optimal polynomial coefficient by differentiating ε N in Equation ( 2) with respect to N + 1 unknown coefficients and setting the derivatives to zero to obtain the following sets of equations so, by interchanging the order of the summation, the set of N + 1 equations in N + 1 unknown is given by Therefore, we can write Equation ( 5) in matrix form by defining the design matrix A = {α λ,i } i.e., (2M + 1) × (N + 1) for the polynomial approximation.The transpose of A as A T = {α i,λ } and the product matrix B = A T A as (N + 1) × (N + 1)symmetric matrix.Then, polynomial coefficient vector is given by f = [ f 0 , f 1 , . . ., f k ] T and input samples vector by x = [x λ−m , . . ., x λ−m , x λ , x λ+m , . . ., x λ+m ] T where, x λ = x = 0. Hence the desired matrix form of normal equation is expressed as Bf = A T Af = A T x and solution for the polynomial coefficient is expressed as f = A T A −1 A T x = Hx, where H matrix is independent of the input samples (it depends only on M and N).Therefore, the output sample can be computed by the convolution equation where h −m = h 0,m = p(n) − M ≤ m ≤ M, and h i,n is the element of the (N + 1) × (2M + 1) matrix H and h 0,m is the element of the 0 th row.

Adaptive Savitizky-Golay Filtering
In this subsection, we describe our proposed Adaptive Savitizky-Golay filter as an improvement of the existing classical Savitizky-Golay filter.Classical Savitizky-Golay filtering is often used to separate the noisy peak in a given peak distribution pattern with assumption that only the corrupted peaks are available and the aim to identify those peaks.First we consider window length (M) and degree of the polynomial (k) to be arbitrary.Thus, we express the coordinates of the local minimum and maximum points of the initial smoothing in the following order We introduce d to be the distance vector containing the number of samples between two neighboring points of local minima and maxima to be Let S = [s 1 , . . ., s 1 ] be the number of peak with same amplitude in a given segment of peak distribution.Thus, variance neighboring peak points between the δ local extrema are determined step-wise by equation The actual variance calculated above based on the previous values >> ε 1 .This forms the first segment of the next part; however, the window must match the spread peak distribution while the polynomial degree must vary depending on the frame-size and peak distribution.Thus, each segment consisting of peaks with similar peak height (amplitude) we applied window length (M) and polynomial degree (k) based on fuzzy relation given by equation where dS is the average length of the segments in the current S parts of the peak distribution, while δ max = max(d) is the observed peak.If f d max , dS = 1 is the peak with highest amplitude in that particular segment.Hence, once we have the coordinates of the local minimum and maximum peak points and the vector d, we assign the k and M values to each S segment using some fuzzy rules, then we apply multi-round linear approximation method according previous work [31] for parameter update.The purpose is to identify the imprecision or inflexion points after the first adaptive Savitizky-Golay smoothing; thus, correction processes enable the introduction of new cutting points for the next adaptive smoothing.

Feature Extraction
To extract feature statistics from the filtered peak we denote peak segment by S; thus, S = {s 1 , s 2 , s 3 , . . . ,s n }, where n denotes the total number of segments obtained.Hence, based on set of S, we extract feature statistic for each segment of CNVs by calculating the local density (ρ) and minimum distance (δ) to obtain the corresponding values of the neighboring peaks in each segments.With the consideration of that regions with changed copy numbers are inherently different from those of normal copy numbers and only account for a small part of the whole genome, we transfer the problem of detecting CNVs to the issue of identifying outliers from the set of segments with features of (ρ) and (δ).Accordingly, each segment can be regarded as an object or a point in the two dimensional space of (ρ) and (δ).In the following text, we provide a detailed description to these two features and the calculation approach.Before we describe the two features (ρ) and (δ), we introduce the Euclidean distance between two segment s i and s j .Given two segment s i and s j with equal length l, we can obtain an Euclidean distance matrix M l×l to measure the distance between each element (d ij ) using the Euclidean distance formula where ρ i and δ i are the feature values of a given genomic segment s i and s j , same apply to ρ j and δ j .Again, using distance matrix M l×l , we calculate the number peaks adjacent to the peaks in segment s i by equation where χ(x) = 1 if x < 0, otherwise χ(x) = 0, ρ i is the local density, γ is adjustable distance threshold.Next, we calculate the minimum distance between the peaks with higher density values in segment s i to rest of peaks by equation where δ i is the minimum distance defined as the minimum value among the distances between the peaks in segment s i and those peaks with higher density than segment s i .Similarly, we can calculate the maximum distance between the peak with highest density in segment S by equation where δ i is the maximum distance defined as the maximum distance between the peak in segment s i and the rest of peaks in the set S. Lastly, if we assume the smoothed peaks have normal distribution, using multivariate Gaussian distribution function [32], we can extract the feature statistic for each segment using equation where µ is the two-dimensional vector corresponding to the mean values of local density and minimum distance, i.e., µ = [ρ, δ], and K is the covariance matrix of the two features.

Effect of Window Length on Smoothing Performance
We evaluate the smoothing performance of both classical Savitzky-Golay filter Section 2.2 and adaptive Savitzky-Golay filter Section 2.3 using sharpening concept, a technique based on standard window function convolution [33].First we consider the corrupted peak f (S i ) by a noise with σ = 1.We then show the effect of standard window function convolution on overlapping noisy peaks filtered by classical Savitzky-Golay and adaptive Savitzky-Golay at different polynomial order k and window lengths as shown in Figure 1 and Figure 2, respectively.From the results we observe that at short window length (m 1 = 53), classical Savitzky-Golay has a low bias and high variance in compared larger window length (i.e., m 3 = 153, m 2 = 201 and m 1 = 253).On the other hand, adaptive Savitzky-Golay shows low bias and high variance in larger window length (see red solid line in Figure 1).According the analysis, for classical Savitzky-Golay filter an increase in window length leads to an increase in smoothing bias and decrease in variance.However, adaptive Savitzky-Golay an increase in window length have minimal or no effect on smoothing bias and variance (see red solid line in Figure 2); this is because the method automatically selects the polynomial order (k) and window length M based on the peak distribution, allowing peaks with a high rate of change to be smoothed accurately.
According Figure 3, the outcome of the subsequent adaptive SG-smoothing.It is clear that the soft cambers are tracked and the shape of the peak distribution is preserved with proper noise component removal.In the case of an asymmetric peak distribution, this iterative method of smoothing and correction performs well.To accomplish this, we first conduct a quick and easy calculation of the coefficients, after which we conduct a highspeed resampling of the peaks to align with the smaller running window.Next, we use straightforward nearest neighbor interpolation to replace the missing values.Furthermore, we investigated the relationship of window function convolution and stopband attenuation for two overlapping noisy peaks based on the calculation of the near-boundary values to find local minima and maxima peaks (i.e., we simply use the polynomial fit over the 2m + 1 neighborhood closest to the boundary).We assume that the noisy peak and the filtered peaks are at or near the boundary with each other and calculate the peak height fidelity based on local minima and maxima values.
The results of identified local minima and maxima values at different window length by the proposed adaptive Savitzky-Golay filterig are shown in Figure 4.The blue stars are the detected local minima and maxima peak values.From these results, we observe that adaptive Savitzky-Golay filter would reduce the peaks to ≈5% of their original height since the algorithm uses a linear approximation of the peaks for precise smoothing.The optimal signal resolution is determined by the local extrema points.The method performs adaptive smoothing and correction iteratively, allowing the shape of fast-varying peaks to be precisely detected.

Evaluation of Filter Order on Smoothing Performance
We evaluate the effect of filter order on smoothing performance by calculation of the minimum mean squared error (MMSE) using equation where r i is noise coefficient, σ is the noise power, f (S i ) is the peak distribution function and k is the filter order.In our simulation, the original peaks were first corrupted the by Gaussian noise with zero mean and two noise power values, i.e., we introduce low noise power (σ = 0.05) and high noise power (σ = 1) at different window to measure minimum MSE.The goal was to check the effect of filter order on estimation error.Simulation results in Table 1 show the effect of different filter order on the estimation error.When we perform the adaptive multi-round filter at different polynomial orders, we observed a relatively high MMSE at low filter order (k 1 = 2) and low MMSE at higher filter order.This implies that MMSE is dependent on polynomial order as result leads to computational burden due to least square (LS) fitting.Since the proposed Adaptive Savitzky-Golay filters select the polynomial order automatically, the peaks with a high rate of change are properly smoothed, thus the computational burden associated with higher polynomial order and window length is reduced.
Table 1.Effect of filter order on the estimation error.

Comparison of Adaptive Savitzky-Golay Filtering with Peer Methods
We compared the adaptive Savitzky-Golay filters performance with the other peer filtering methods including: Fourier transform [15], Gaussian Kernel filtering [16], Epanechnikov Kernel [17], Lowess [18] smoothing algorithms.In analysis, we use moving average optimal window length approach to compare the smoothing performance.We first corrupted the peaks by adding noise power (σ 2 ) to the original peaks.The noise power, i.e., the ratio between the output and input root-mean-square noise, were calculated for white noise.Figure 5A, shows the comparison results of noise suppression.We can observe a non-linear relationship between the noise power and window length, that is an increase in noise power leads to an increase in window length this implies that adaptive Savitzky-Golay have better performance in optimal window length estimation compared to other peer methods.All smoothing methods produce comparable results, with the adaptive Savitzky-Golay filters outperforming the others in terms of noise power.With increasing window length, Fourier smoothing offers slightly less noise suppression than Lowless, Gaussian, and Epanechnikov Kernel.As a result of the more gradual cutoff in the frequency domain, noise suppression of k 1 = 2 filters is slightly weaker than that of higher degrees k filters.In addition, we compared adaptive Savitzky-Golay peak height fidelity to that of other peer filtering methods.In this case, we measure the fwhm peak with 90% peak height fidelity.We measure the white noise gain and define the noise bandwidth as the integral over the kernel's power spectrum, with the full bandwidth corresponding to the peak function.The gain of the white noise is then proportional to the square root of the bandwidth.Increasing the bandwidth causes less attenuation of a peak with a given full width at half maximum (fwhm); sharper peaks (lower fwhm) require more bandwidth.We can plot the peak height fidelity as a function of the product of the noise bandwidth and the fwhm, which is largely independent of the specific bandwidth or fwhm value.
The merits of the various filters are then shown in Figure 5B.If a specific peak height fidelity (e.g., ≈90% of the original peak height) is required, the curve with the lowest noise bandwidth for white noise-that is, it best suppresses the noise.Convolution with a Gaussian kernel performs the worst, according to the results (except when the peaks are strongly attenuated to less than ≈40% of their original height).The adaptive Savitzky-Golay filters, Lowess, and Fouier filters are nearly equal and best (the difference is less than the line width in), and the Epanechnikov Kernel comes close.The Gaussian kernel filter performs worse than our filters due to poor high-frequency noise attenuation (see Figure 5B).Increasing the bandwidth of Adaptive Savitzky-Golay filtering provides an improve peak height fidelity from ≈63% to ≈90%.In addition, we compare the RMSE of adaptive Savitzky-Golay filters with peer filtering methods.According to Figure 5C, we found that all methods produce similar results; however, adaptive Savitzky-Golay filters recorded a low Root Mean Squared Error (RMSE) due minimal estimation bias (see more additional result in Figure A1, Appendix A.)

Application in CNVs Peak Analysis
Note that our proposed adaptive Savitzky-Golay filtering is a data smoothing and feature extraction tool for CNVs peak profile analysis.Therefore, in our analysis we focus on following aspects: (1) to smooth CNVs and segment the observed RD peak profile, so that adjacent bins with similar peak amplitudes can be merged into the same region and the bins showing a local variation is unmasked; (2) to extract meaningful feature statistic from peak profile data an accurate distinction between mutated and normal genomic regions.
(3) to provide a reasonable model for visualising the extracted features to perform a suitable analysis of the features to determine CNVs.

Data Preparation
In this study, experimental WES data generated for germline mutation analysis using the Illumina DNA exome kit were used.For the experimental simulation data, we generated two data sets.We used the coverage depth profile of individual samples (LONG tag) that was based on all reads of the given sample.Furthermore, we generated a validation data set (SHORT) based on the short reads (length ≤ 80 base pairs) of a large cohort of samples.We use "samtools depth" command with the "-a"/output to identify all positions (including zero depth)/and "-b segments.bed"options to query the read depth of interests from our BAM (binary sequence alignment map files).

Simulation Studies
Simulation studies are usually regarded as an appropriate and feasible way to assess the performance of existing and newly developed methods [34].This is because the ground truth CNVs embedded in the simulated data sets could be used for an exact calculation of sensitivity and precision for the methods.CNV peak distribution are often asymmetrically in nature due to variation in depth coverage attributed to the GC bias during the hybridization process.The CNV peak generation often depends on the read depths coverage of the target region and the oligo capture baits within those regions which results into asymmetrical peak distribution.This poses a great challenge smoothing such peaks; thus, it is important to model a peak function that applied to a wide class of the peak distribution patterns obtained from different genomic segments.With a careful consideration of the problems described above, we filtered all CNV peaks for each genomic segments using the proposed adaptive Savitzky-Golay filtering method.For optimal CNV peak filtering, we apply the concept described in Section 2.3.Using Equation (15), the filter order and optimal window length values are automatically computed by adaptive Savitzky-Golay filters.The feature values of CNV peak in each genomic segment were then extracted using Equations ( 12) and (13).The statistical significance for each peak at given genomic segments was then calculated using Equation (15).According to the results shown in Figure 6, there was significant level of feature extraction for all the genomic segments evaluated for CNV peak (i.e., both long and short tag peaks-also see results in Figures A2 and A3).

Discussion
Although several filtering algorithm exists, most of them require the characterization and model parameter tuning for efficient filtering and smoothing of the peaks or signal data [18,[35][36][37].Savitzky-Golay method was originally developed to make discernible the relative widths and heights of spectral lines.The Savitzky-Golay filters are widely used in many fields of data processing, ranging from spectra in analytical chemistry to geosciences and medicine [38].The SG method was originally developed to make discernible the relative widths and heights of spectral lines.It equally smooths the noise and the signal components, as it leads to bias and a reduction in resolution.For denoising peaks with a large spectral dynamic or with a high rate of change, the classical SG filtering is an unsuitable method [39].In addition, efficiency depends on the appropriate selection of the polynomial order and the window length, which should match the intrinsic scale of the input peaks.However, the SG-filters provide excellent results while preserving simplicity and speed, but most of the applications require the users to arbitrarily select the polynomial order and size of the sliding window.In general, the SG filters perform well when we apply a low-order polynomial with long window length or low degree with short window and repeated smoothing.It has also been shown that the smoothing effect decreases by applying low-order polynomial on higher frequencies or high-order polynomials on lower frequency parts of the peaks.With this in mind, in this study we introduced an adaptive filtering method based on SG filtering and feature extraction algorithm, which provides good performance independent of the type of noise in the CNV peaks.The proposed technique ensures high-precision noise reduction by iterative multi-round smoothing and correction.In each round, the parameters are dynamically changed due to the results of the previous smoothing.Our approach provides additional support for data compression based on optimal resolution of the peak with linear approximation as well as density and Euclidean-distance-based function for feature extraction as an option to preprocessing CNV peaks.The classical SG-filter depends on the appropriate setting of the window length and the polynomial degree, which should match the scale of the signal, since in the case of signals with high rate of change, the performance of the filter may be limited [19].Here, simulation results validate the applicability of our approach in the analysis of CNV peak in whole exome sequencing data.Meanwhile, we demonstrated the difference and the robustness of adaptive Savitzky-Golay and its application with other peer filter methods (Fourier [35], Epanechnikov Kernel [36], Gaussian Kernel [37] and Lowess [18]).
Our findings demonstrate the relationship between the adaptive Savitzky-Golay window lengths and filter order (polynomial degree).We discovered that noise suppression is effective when the filter order and window length are set to optimal values.We noted that it is critical to compute the derivative and set an optimal window length before applying the adaptive Savitzky-Golay algorithm to noisy peaks.We have seen, for example, that there is high estimation bias with weak smoothing (e.g., to preserve peak patterns a window length of m 2 = 201 and a filter order of k 2 = 3 is necessary).For strong smoothing, however, the difference between differentiation first and smoothing first grows.As a result of insufficient noise suppression near the boundaries, the "derivative first, then smoothing" is ineffective for strong smoothing.In a broad sense, this shows that adaptive Savitzky-Golay is more effective at noise suppression when the window size is small and the filter order is high (see Figure 1).We also noticed that when comparing different smoothing methods, the trade-off between window size and noise suppression does not always apply.The Gaussian Kernel, on the other hand, has the worst noise suppression at the boundaries in all cases, and its filtered peaks exhibit more estimation bias.As a result, other methods perform better when smoothing near-boundary data in terms of artifacts and noise suppression (see Figure 2).In addition, we found a non-linear relationship between noise power, RMSE, and window size (see Figure 3).Figure 3A shows that Savitzky-Golay has a higher noise power than other peer filtering methods.However, we discovered that adaptive Savitzky-Golayhas has a lower RMSE and performs better.In general, the Fourier, Epanechnikov Kernel, and Gaussian Kernel filters performed noticeably worse in terms of estimation error (RMSE) when compared to the Lowess filtering method, which provided slightly better noise suppression though not effective than adaptive Savitzky-Golay.
Using our proposed peak detection and analysis, we demonstrated the functionality of the adaptive Savitzky-Golay filter for peak smoothing and density-based peak detection methods for the statistical feature extraction of smoothed peaks in each genomic segments.According to the results, we observed that, after filtering WES, we obtained the short-and long-tag CNV peaks in all genomic segments with roughly the same count position when filtering at optimal window size (m 1 = 51) and filter order (k 4 = 51).We also noted that adaptive Savitzky-Golay completely suppressed noise at the boundaries of short-and long-tag CNV peaks (see Figure 4).We also discovered that increasing the window size at the fix filter order results in low suppression noise at the boundaries of short-and long-tag CNV peaks, resulting in high peak height fidelity due to under-fitting (high estimation bias).Furthermore, visual inspection reveals that noise suppression at high filter order is nearly as good as convolution with small window length (see Figure A1).Generally, the average read size (220-270 base pair) was larger than the 80bp oligo capture bait size and also that any oligo capture bait could be designed in any position, including partially overlapping positions or gaps less than the peak width we used two approaches to identify the sub peaks corresponding to the designed oligo capture baits in the genomic segments.As the readily available approach, first we used all reads of the samples and for the individual coverage profiles of each samples we tried to infer the position of the sub peaks.In the second approach, we used the fact that average fragment size in a sample is a distribution of randomly broken DNA fragments that includes a small fraction (usually less than 1-2% of all reads in our samples) of reads with comparable length of the oligo capture baits (80 base pairs in the used kit).In cases where we have a high number of samples, we can filter all short reads with comparable size with the oligo capture baits, as this would be a much better approach to identify the exact genome positions of the design.However, since such sort reads are usually only 1-2% of the whole sequence in practice, 80-100 samples are not always available.Therefore, we wanted to explore whether the bulk of the data could be used to infer these positions.As we mentioned in Section 3.4.1,we used experimental WES data using the Illumina DNA exome kit.However, from the experimental data, we generated two data sets.We used the coverage depth profile of individual samples (LONG tag) that was based on all reads of the given sample.Furthermore, we generated a validation data set (SHORT) based on the short reads (length ≈ 80 bp) of a large cohort of samples.
Furthermore, we discovered that there is an optimal DNA fragment size in each hybridization based on the WES kit that produces the maximum coverage result.This fragment size is higher than the probe size (≈80 bp) since it is optimized for Paired-End Sequencing, so the kit-specific 100 or 150 base pair Paired-End Sequencing does not overlap too much.This suggests that the bulk of the reads are less than or equal to the probe size, and it is unclear which region of the fragment was precisely hybridizing to the oligo capture bait, leading to more spread and superimposed peaks (see Figures A1 and A2).Surprisingly, there were statistically some smaller DNA fragments in one sample, but in such low amounts that most of our hybridisation oligo captured no DNA fragments.However, if we collected all the short reads from many samples, we could use these data to find the individual subpeaks that are superimposed on each other when two oligo capture baits are closer to each other than the average DNA fragment size (minus the bait size).As a result, we used a large cohort of data to locate the exact sub peak positions using short reads.These data can be used to assess the performance of the CNV peak detection algorithms by evaluating whether the same peak centers can be detected from a single sample.
From a clinical point of view, targeted resequencing is used to sequence a smaller portion of the human genome.Usually, these are the coding regions of genes (WES) or just a smaller thematic gene panel, but the region of interest could also be UTR regions, non-coding RNA, deep intronic varoations associated with disease, etc.Targeted RefSeq is cost effective, as far fewer sequences (reads) are needed to achieve higher genome coverage.However, since the target region is not randomly covered by reads like at WGS but depends on the oligo capture baits, detecting the genome coverage peaks is important to perform CNV analysis in targeted reseq.data.

Conclusions
In this study, we proposed adaptive Savitzky-Golay filtering.In order to effectively smooth peaks with high amplitudes, our proposed technique automatically chooses the polynomial order and window length based on the peak distribution.The algorithm uses a linear approximation of the peaks for accurate and consistent smoothing.The local extrema points serve as the foundation for the optimum peak smoothing.The method applies adaptive smoothing and correction iteratively, making it possible to detect the shape even of rapidly varying peaks.In addition to the complete removal of the noise components, the significant peak features are, however, preserved, irrespective of the nature of the noise process.Furthermore, a decomposition of the peaks in this sense with linear approximation enables practical data compression.The results of the simulations have demonstrated that the proposed technique enables excellent performance.Our approach outperforms other peer smoothing techniques and is a definite improvement of existing approaches.The application of read depth (RD) profiles of WES data to real-world experimental data showed low peak height fidelity (low estimation bias) and the significant detection of short-and long-tag CNV peaks in all genomic segments.Therefore, we demonstrated the efficiency of the adaptive Savitzky-Golay filtering method and its applications in CNVs peak detection, which could be useful as a supplement to existing methods in the field of CNV analysis.

Appendix B
Figures A2 and A3 is additional comparison results CNVs peak detection at using Adaptive Savitzky-Golay filtering before and after correction (i.e., adjustment of polynomial order and window length).Both Figures A2 and A3, the right hand side show smoothing before Adaptive Savitzky-Golay correction and Left hand side show smoothing after Adaptive Savitzky-Golay correction.For each genomic segment, we use Equation ( 9) described in Section 2.3 to calculate the statistical significant for segment.The aim is to demonstrate the smoothing performance of adaptive Savitzky-Golay filtering before and after correction by linear approximation.

Figure 5 .
Figure 5.Comparison of performance of Adaptive Savitzky-Golay filtering with peer filtering methods.(A): Comparison based on noise power; (B): Comparison based on peak height fidelity; (C): Comparison based on smoothing bias.

Figure 6 .
Figure 6.Show CNV peaks from different genomic segment filtered using Adaptive Savitzky-Golay filters.Green and red solid lines are the original short-and long-tag CNV peaks , respectively; cyan and blue solid lines are the smoothed the short-and long-tag CNV peaks, respectively.

Figure A2 .
Figure A2.Show Adaptive Savitzky-Golay filtering of CNV peaks for segment 1 to 5 before and after correction.Green and red solid line are the original short and long tag CNV peaks , respectively; cyan and blue solid lines are the smoothed the short and long tags CNV peaks, respectively.

Figure A3 .
Figure A3.Show Adaptive Savitzky-Golay filtering of CNV peaks for segment 6 to 10 before and after correction.Green and red solid line are the original short and long tag CNV peaks , respectively; cyan and blue solid lines are the smoothed the short and long tags CNV peaks, respectively.