Copy Number Studies in Noisy Samples

System noise was analyzed in 77 Affymetrix 6.0 samples from a previous clinical study of copy number variation (CNV). Twenty-three samples were classified as eligible for CNV detection, 29 samples as ineligible and 25 were classified as being of intermediate quality. New software (“noise-free-cnv”) was developed to visualize the data and reduce system noise. Fresh DNA preparations were more likely to yield eligible samples (p < 0.001). Eligible samples had higher rates of successfully genotyped SNPs (p < 0.001) and lower variance of signal intensities (p < 0.001), yielded fewer CNV findings after Birdview analysis (p < 0.001), and showed a tendency to yield fewer PennCNV calls (p = 0.053). The noise-free-cnv software visualized trend patterns of noise in the signal intensities across the ordered SNPs, including a wave pattern of noise, being co-linear with the banding pattern of metaphase chromosomes, as well as system deviations of individual probe sets (per-SNP noise). Wave noise and per-SNP noise occurred independently and could be separately removed from the samples. We recommend a two-step procedure of CNV validation, including noise reduction and visual inspection of all CNV calls, prior to molecular validation of a selected number of putative CNVs.


Introduction
Genomic copy number variation (CNV) was associated with a variety of clinical phenotypes [1][2][3][4][5][6]. Hence, the study of CNV is of diagnostic importance. CNV identification from high-density SNP-microarrays may be unreliable, particularly in noisy data [7][8][9]. Therefore, extensive validation of CNV findings is needed. Since CNV detection software may identify hundreds of putative CNVs in each sample and since validation of CNV findings by qPCR, or by other molecular methods, is laborious, we searched for simple strategies to evaluate large numbers of CNV findings.
Rigorous studies revealed that several components of system error occur in copy number data [10][11][12][13]. Here we focus on two major types of noise and present the noise-free-cnv software package for the visualization of copy number data and for the reduction of noise. This software enables large-scale inspection of CNV findings (produced by PennCNV [14], Birdview [15,16], or other specialized software packages). For illustration, we used 77 microarrays from a previous study of patients with cervical artery dissection from Switzerland and Southern Germany (age: 42.5 ± 9.8 years; 31 (40.3%) women) [17]. DNA was isolated from peripheral blood samples (no DNA from lymphoblastoid cell lines was used). DNA extraction, array hybridization, and array scanning were performed according to the manufacturer's instructions [17]. The LRR and BAF values were obtained from the CEL files with the Affymetrix Power Tools software (APT). The quantile normalization was done in APT. The LRR and BAF can be then imported to PennCNV, to other CNV detections software packages (QuantiSNP, MAD), or to noise-free-cnv.
The Affymetrix 6.0 microarrays used for CNV detection contain a total of 906,600 single nucleotide polymorphisms (SNPs) and 946,000 non-polymorphic copy number probes (CNPs) covering all human chromosomes. In the present article, the notion of SNP is used for all analyzed probe sets (SNPs as well as CNPs). Figure 1 shows two samples (visualized by noise-free-cnv), displaying signal intensity (LRR-upper panel) and B-allele frequency (BAF-lower panel) of all SNPs ordered along the chromosomes. The Log R Ratio (LRR) is a normalized measure of the total signal intensity for two alleles of the SNP. The B-Allele Frequency (BAF) is a normalized measure of the allelic intensity ratio of two alleles [18]. Signal intensities in sample ID 2355 show larger variance than in ID 1022. Moreover, a prominent pattern of waves is apparent in sample ID 2355. In many samples, we observed similar wave patterns. The noise-free-cnv software identified waves using a Gaussian filter with a large standard deviation, for instance comprising 1,000 SNPs. This filter "blurs" the values as shown in Figure 2(G,H). We called the resulting wave data the wave component of the LRR values. The variance of the blurred LRR values is a measure for the prominence of waves, the wave variance.  were compared with signal intensities of SNPs of chromosome 7 of two patients (C,D) and with a human prometaphase (E) and metaphase (F) chromosome 7. Signal intensities shown in C and D were smoothed (noise-free-cnv software, function "blur" across 1,000 probe sets) to visualize genomic waves (G,H).

Noise Components
This wave pattern was compared with the banding pattern of metaphase chromosomes ( Figure 2). Human metaphase chromosomes were stained with the Giemsa-trypsine procedure, which induces a banding pattern. AT-rich regions are more frequent in Giemsa-dark bands than in Giemsa-light bands [19,20]. In our study samples, Giemsa-dark bands corresponded to genomic regions with reduced probe set signals. This pattern of noise was described by others as "genomic waves" or "CG-waves" [10][11][12][13]. The co-linearity of genomic waves with Giemsa bands illustrates that genomic waves follow a similar pattern in all samples.
After subtraction of the wave component, the resulting LRR values follow an approximately normal distribution around zero. We called the resulting values per-SNP component and their variance the per-SNP variance. The decomposition of system noise in wave component and per-SNP component is shown for one sample in Figure 3. Wave variance and per-SNP variance components were calculated for all samples in Table A1. The system deviations of individual SNP signal intensities are strongly correlated across samples ( Figure 4). To quantify the correlation of the noise (variance) components between different samples, we computed two additional data series: for each SNP the median through all 77 per-SNP components was computed and saved as the per-SNP profile. For the wave profile the same procedure was applied to the wave components. We then computed, for each sample, the correlation between the wave profile and the (individual) wave component as well as the correlation between the per-SNP profile and the (individual) per-SNP component. Details of the algorithm are described in Appendix. The high correlations found in our 77 samples confirmed that wave noise and per-SNP noise are system noise, i.e., follow highly non-random patterns. On average, the correlation was 0.843 for the wave component and 0.568 for the per-SNP component.

Factors Associated with Quality of Copy Number Data
The resolution of a classical chromosome study depends on the quality of the chromosomes and is expressed as the total number of visible cytogenetic bands (400 bands: low to moderate quality; 850 bands: excellent quality). According to our knowledge, no comparable quality metric for molecular karyotyping exists. Quality control in most copy number studies consists of rejecting samples with outlier numbers of CNV findings. A quality metric for the resolution of a CNV study (relating the size of a CNV and the likelihood of its detection) has not yet been defined. In the current study we propose a preliminary quality metric based on the median number of SNPs per chromosome with copy number state (CN) ≠ 2 (numbers/chromosome for all cases are shown in Table A1). Copy Number state of each SNP was determined by the Affymetrix Power Tools software package (APT). SNPs located in common CNVs were excluded from this analysis. To identify SNPs located in common CNVs, we analyzed 403 control samples without visible waves and with highest genotype call rates selected from a large German population (PopGen [21]), as described before [17]. The median number of SNPs with CN ≠ 2 per chromosome was considered as a preliminary quality metric. The quality of a sample was related to the chromosomal background of SNPs with abnormal copy number ( Figure 5). We defined deliberate quality categories: samples were classified as eligible, if the median number of SNPs per chromosome with CN ≠ 2 was zero, those with >100 SNPs with CN ≠ 2 were classified as ineligible. Samples were classified according to the defined quality categories in Table 1. The use of freshly prepared DNA (compared to DNA samples that were used since years and had been thawed and frozen repeatedly) was a significant determinant of eligible samples (p < 0.001). Samples with high call rate (rate of successfully genotyped SNPs) were more likely to be suitable for copy number studies than those with lower call rates (p < 0.001). Low levels of wave variance as well as per-SNP variance were associated with eligibility for CNV analysis (p < 0.001). Eligibility for CNV studies was not significantly associated with the median number of calls by PennCNV (p = 0.053). However, eligible samples had between 63 and 165 calls, while the range of calls was much broader in ineligible samples. Birdview yielded significantly more calls in ineligible samples (p < 0.001). The proportion of putative false positive Birdview calls increased with decreasing confidence rates: The number of CNV findings with confidence below 2.5 was most strongly elevated.  Figure 6 summarizes salient aspects of system noise in SNP microarrays. Figure 6(A) plots for each sample the variances of wave component and per-SNP component. Wave variance and per-SNP variance seem to occur independently from each other: the observed correlation between both noise components (r = 0.124) was not significant (p = 0.401). Figure 6(B) illustrates the relation between sample eligibility and noise components in the eligible (n = 23) and ineligible (n = 29) cases. Eligible samples (i.e., those that are supposed to be excellent for copy number studies) have low levels of per-SNP variance. Samples with high wave variance are inappropriate for copy number studies.

Noise Reduction in Copy Number Samples
The noise-free-cnv software package permits the visualization of samples, the isolation of noise components and the subtraction of isolated noise components. The next two examples (Figures 7 and 8) illustrate noise reduction by comparing a test sample with a reference sample. We finally demonstrate the use of the noise-free-cnv-filter algorithm for the evaluation of CNVs.    (Table A1; Figure 5). Neither PennCNV nor Birdsuite identified a large CNV on chromosome 5. After noise reduction, LRR and BAF values were suggestive for the presence of a mosaic deletion [22][23][24] (Figure 8(B,D)). To confirm the diagnosis of a mosaic deletion, a conventional chromosome analysis was performed: Some rare 5q chromosomes were observed amongst a majority of normal chromosome sets. Interestingly, it was recently demonstrated that the identification of mosaic abnormalities by microarray analysis is unreliable [25].
We developed the noise-free-cnv-filter algorithm for optimized noise reduction (Appendix). In the samples of our study population, noise-free-cnv-filter analysis resulted in an average reduction of the wave variance by 74.2%, of per-SNP variance by 35.3% and of the overall variance by 38.1%. Noise-reduction according to this algorithm supports the evaluation of CNV findings, in particular when the putative CNVs are small (Figure 9).
In patient ID 715, both Birdview and PennCNV identified a deletion on chromosome 18 (green bar in Figure 9). Noise-free-cnv-filter analysis of the sample (ID 715 nf) suggested that the deletion was true. Subsequent molecular analysis confirmed the finding: the joining segment of the deletion was identified by a case-specific PCR and the breakpoints of the deletion were identified by DNA sequencing following standard procedures [17,26]. Two putative duplications in patients ID 412 were evaluated after noise-free-cnv-filter analysis. We considered the duplication in chromosome 1 (region 222 Mb) as spurious (red bar), but the duplication in chromosome 9 as probably true. As a consequence, this putative duplication is a candidate for further validation by molecular methods.  show LRR values after noise-free-cnv-filter analysis. Samples were renamed with suffix "nf" after noise-free-cnv-filter analysis. Bars indicate putative CNV findings.

Conclusions-Proposal of a Two-Step Procedure for the Validation of CNV Findings
Our analysis had the following key findings: (1) Copy number samples may be noisy, which interferes-above a certain level of noise-with reliable identification of CNVs; (2) Eligible copy number samples were more likely when fresh DNA was used for microarray hybridization; (3) wave component and per-SNP component of noise are independent; (4) noise-free-cnv software enables noise reduction by subtracting wave and per-SNP noise components from samples; and (5) noise-free-cnv software supports the quality control of copy number data and the validation of copy number findings.
The current noise-free-cnv version was developed for the analysis of SNP microarray samples and was not designed for noise reduction in array based comparative genomic hybridization samples. The present study highlighted the value of noise reduction for large scale CNV validation (after software-assisted CNV detection). However, the value of noise reduction before software-assisted CNV detection is to be analyzed in future studies.
Based on our analysis of noise in real-life copy number samples we suggested a two-step procedure of CNV validation. As a first step of preliminary CNV validation we proposed large-scale inspection of CNV findings after noise reduction, to select putative candidate CNVs and reject false positive findings. In a second stage, this selection of putative CNV calls is analyzed further by independent molecular methods for final validation [17,26].

A1. Noise-Free-CNV
The noise-free-cnv program package was specifically developed to analyze copy number variation in SNP-microarray samples and to manipulate the data in order to reduce noise. It was written in C++ and released as free software under the GNU General Public License version 3. Installer packages are available for Debian-based Linux systems and Windows. For the computation of the Fast Fourier Transform, we used the FFTW library [27]. Noise-free-cnv is compatible with the file format used by PennCNV [14].
The central program of the noise-free-cnv package is noise-free-cnv-gtk, a visual editor for interactive visualization and manipulation of SNP microarray data. Besides functioning as a browser for direct inspection and verification of CNV findings, it allows the user to perform many operations on the data. These include the Gaussian filters and variance computation referred to in the article. For further information, see the project homepage http://noise-free-cnv.sourceforge.net. A second program, noise-free-cnv-filter, implements a specific algorithm for system noise reduction, as described below. It is usable as a command line program to be easily applied to a batch of samples.

A2. The Noise-Free-CNV-Filter Algorithm
The noise reduction algorithm noise-free-cnv-filter consists of two main steps. In the first step, a genomic wave profile and a per-SNP noise profile are deduced from a batch of samples. In the second step, these profiles are used to modify the individual samples.

A2.1. System Noise Assessment
For each individual sample: (1) The non-autosomal data is removed and the Log R Ratio values are normalized towards an average value of zero. Subsequently, the batch-specific wave is computed by regarding each SNP throughout the wave components of all samples and taking the median value. The same is done for the per-SNP profile utilizing the per-SNP components.

A2.2. System Noise Removal
In the second step, we use the median profiles to adjust the original samples. For each individual: (1) The covariance of the wave component and the batch-specific wave profile is divided by the variance of the wave profile. (2) The result is used as a scaling factor for the wave profile, the scaled profile is then subtracted from the wave component The same procedure is repeated on the per-SNP components.
(3) Finally, the corrected components are added together and yield the corrected Log R Ratio values.

A3. Program Usage
Noise-free-cnv-filter was implemented as a command-line program. In the most simple case, it receives the file names of several SNP microarray samples in the PennCNV file format (due to the nature of the algorithm, application on a single sample is pointless). It then computes the profiles (saved as "wave_profile" and "per-snp_profile") and the cleaned versions of all provided samples, which it saves as "<original filename>.nf". As additional options, noise-free-cnv-filter allows the use of pre-computed profile sequences and the inclusion of the sex chromosomes into the analysis. As an example, noise-free-cnv-filter-verbose individuals/* applies the algorithm to all files in the directory individuals, discards the sex chromosomes and outputs detailed information about the progress and statistical information about the samples. For further help, type: noise-free-cnv-filter-help.      For each sample, genotype call rate, variance, wave variance and per-SNP variance were calculated. The remaining columns show for each chromosome (chromosome number indicated) the number of probe sets with CN ≠ 2. This analysis included only probe sets that had normal copy number (CN = 2) in 403 samples from a population based German study (for details and references see [17]). A non-random distribution of probe sets with CN ≠ 2 is highly suggestive for the existence of a rare CNV (for instance ID 1147 or ID653, in contrast to ID 1042 or ID 2034). Even in samples with high variance, non-random distribution can be detected (chromosome 5 of ID D_062, chromosomes 1 and 2 in ID 442).