Next Article in Journal / Special Issue
Chromosomal Microarrays in Prenatal Diagnosis: Time for a Change of Policy?
Previous Article in Journal / Special Issue
Kernel-Based Aggregation of Marker-Level Genetic Association Tests Involving Copy-Number Variation
 
 
Please note that, as of 18 July 2017, Microarrays has been renamed to High-Throughput and is now published here.
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Copy Number Studies in Noisy Samples

1
Neurology Department, University of Heidelberg, INF 400, Heidelberg D69120, Germany
2
Division of Molecular Genetic Epidemiology, German Cancer Research Center, INF 280, Heidelberg D69120, Germany
3
Stroke Unit and Department of Neurology, University Hospital Basel, Petersgraben 4, Basel CH4031, Switzerland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Microarrays 2013, 2(4), 284-303; https://doi.org/10.3390/microarrays2040284
Submission received: 20 September 2013 / Revised: 24 October 2013 / Accepted: 25 October 2013 / Published: 6 November 2013

Abstract

:
System noise was analyzed in 77 Affymetrix 6.0 samples from a previous clinical study of copy number variation (CNV). Twenty-three samples were classified as eligible for CNV detection, 29 samples as ineligible and 25 were classified as being of intermediate quality. New software (“noise-free-cnv”) was developed to visualize the data and reduce system noise. Fresh DNA preparations were more likely to yield eligible samples (p < 0.001). Eligible samples had higher rates of successfully genotyped SNPs (p < 0.001) and lower variance of signal intensities (p < 0.001), yielded fewer CNV findings after Birdview analysis (p < 0.001), and showed a tendency to yield fewer PennCNV calls (p = 0.053). The noise-free-cnv software visualized trend patterns of noise in the signal intensities across the ordered SNPs, including a wave pattern of noise, being co-linear with the banding pattern of metaphase chromosomes, as well as system deviations of individual probe sets (per-SNP noise). Wave noise and per-SNP noise occurred independently and could be separately removed from the samples. We recommend a two-step procedure of CNV validation, including noise reduction and visual inspection of all CNV calls, prior to molecular validation of a selected number of putative CNVs.

1. Introduction

Genomic copy number variation (CNV) was associated with a variety of clinical phenotypes [1,2,3,4,5,6]. Hence, the study of CNV is of diagnostic importance. CNV identification from high-density SNP-microarrays may be unreliable, particularly in noisy data [7,8,9]. Therefore, extensive validation of CNV findings is needed. Since CNV detection software may identify hundreds of putative CNVs in each sample and since validation of CNV findings by qPCR, or by other molecular methods, is laborious, we searched for simple strategies to evaluate large numbers of CNV findings.
Rigorous studies revealed that several components of system error occur in copy number data [10,11,12,13]. Here we focus on two major types of noise and present the noise-free-cnv software package for the visualization of copy number data and for the reduction of noise. This software enables large-scale inspection of CNV findings (produced by PennCNV [14], Birdview [15,16], or other specialized software packages). For illustration, we used 77 microarrays from a previous study of patients with cervical artery dissection from Switzerland and Southern Germany (age: 42.5 ± 9.8 years; 31 (40.3%) women) [17]. DNA was isolated from peripheral blood samples (no DNA from lymphoblastoid cell lines was used). DNA extraction, array hybridization, and array scanning were performed according to the manufacturer’s instructions [17]. The LRR and BAF values were obtained from the CEL files with the Affymetrix Power Tools software (APT). The quantile normalization was done in APT. The LRR and BAF can be then imported to PennCNV, to other CNV detections software packages (QuantiSNP, MAD), or to noise-free-cnv.
The Affymetrix 6.0 microarrays used for CNV detection contain a total of 906,600 single nucleotide polymorphisms (SNPs) and 946,000 non-polymorphic copy number probes (CNPs) covering all human chromosomes. In the present article, the notion of SNP is used for all analyzed probe sets (SNPs as well as CNPs).

2. Noise Components

Figure 1 shows two samples (visualized by noise-free-cnv), displaying signal intensity (LRR—upper panel) and B-allele frequency (BAF—lower panel) of all SNPs ordered along the chromosomes. The Log R Ratio (LRR) is a normalized measure of the total signal intensity for two alleles of the SNP. The B-Allele Frequency (BAF) is a normalized measure of the allelic intensity ratio of two alleles [18]. Signal intensities in sample ID 2355 show larger variance than in ID 1022. Moreover, a prominent pattern of waves is apparent in sample ID 2355. In many samples, we observed similar wave patterns. The noise-free-cnv software identified waves using a Gaussian filter with a large standard deviation, for instance comprising 1,000 SNPs. This filter “blurs” the values as shown in Figure 2(G,H). We called the resulting wave data the wave component of the LRR values. The variance of the blurred LRR values is a measure for the prominence of waves, the wave variance.
Figure 1. Signal strength (LRR) and B-allele frequency (BAF) of samples from two male patients (ID 2355 and ID 1022). SNPs were visualized in increasing position along the chromosomes. LRR values of patient ID 2355 have larger variance and show pronounced wave noise.
Figure 1. Signal strength (LRR) and B-allele frequency (BAF) of samples from two male patients (ID 2355 and ID 1022). SNPs were visualized in increasing position along the chromosomes. LRR values of patient ID 2355 have larger variance and show pronounced wave noise.
Microarrays 02 00284 g001
Figure 2. Wave noise. Ideograms of pro-metaphase (A) and metaphase (B) chromosome 7 were compared with signal intensities of SNPs of chromosome 7 of two patients (C,D) and with a human prometaphase (E) and metaphase (F) chromosome 7. Signal intensities shown in C and D were smoothed (noise-free-cnv software, function “blur” across 1,000 probe sets) to visualize genomic waves (G,H).
Figure 2. Wave noise. Ideograms of pro-metaphase (A) and metaphase (B) chromosome 7 were compared with signal intensities of SNPs of chromosome 7 of two patients (C,D) and with a human prometaphase (E) and metaphase (F) chromosome 7. Signal intensities shown in C and D were smoothed (noise-free-cnv software, function “blur” across 1,000 probe sets) to visualize genomic waves (G,H).
Microarrays 02 00284 g002
This wave pattern was compared with the banding pattern of metaphase chromosomes (Figure 2). Human metaphase chromosomes were stained with the Giemsa-trypsine procedure, which induces a banding pattern. AT-rich regions are more frequent in Giemsa-dark bands than in Giemsa-light bands [19,20]. In our study samples, Giemsa-dark bands corresponded to genomic regions with reduced probe set signals. This pattern of noise was described by others as “genomic waves” or “CG-waves” [10,11,12,13]. The co-linearity of genomic waves with Giemsa bands illustrates that genomic waves follow a similar pattern in all samples.
After subtraction of the wave component, the resulting LRR values follow an approximately normal distribution around zero. We called the resulting values per-SNP component and their variance the per-SNP variance. The decomposition of system noise in wave component and per-SNP component is shown for one sample in Figure 3. Wave variance and per-SNP variance components were calculated for all samples in Table A1.
Figure 3. Noise components. LRR values of a noisy sample (A), split up in wave component (B) and per-SNP component (C). All SNPs of chromosomes 1–3 were shown (chromosomes indicated on top of panel A).
Figure 3. Noise components. LRR values of a noisy sample (A), split up in wave component (B) and per-SNP component (C). All SNPs of chromosomes 1–3 were shown (chromosomes indicated on top of panel A).
Microarrays 02 00284 g003
The system deviations of individual SNP signal intensities are strongly correlated across samples (Figure 4). To quantify the correlation of the noise (variance) components between different samples, we computed two additional data series: for each SNP the median through all 77 per-SNP components was computed and saved as the per-SNP profile. For the wave profile the same procedure was applied to the wave components. We then computed, for each sample, the correlation between the wave profile and the (individual) wave component as well as the correlation between the per-SNP profile and the (individual) per-SNP component. Details of the algorithm are described in Appendix. The high correlations found in our 77 samples confirmed that wave noise and per-SNP noise are system noise, i.e., follow highly non-random patterns. On average, the correlation was 0.843 for the wave component and 0.568 for the per-SNP component.

3. Factors Associated with Quality of Copy Number Data

The resolution of a classical chromosome study depends on the quality of the chromosomes and is expressed as the total number of visible cytogenetic bands (400 bands: low to moderate quality; 850 bands: excellent quality). According to our knowledge, no comparable quality metric for molecular karyotyping exists. Quality control in most copy number studies consists of rejecting samples with outlier numbers of CNV findings. A quality metric for the resolution of a CNV study (relating the size of a CNV and the likelihood of its detection) has not yet been defined.
Figure 4. per-SNP system noise. Signal intensities in genomic region 2: 189766706–189891527 shown for four patients (ID 1020; ID 1022; ID1026; ID 1028). The lower panel shows the per-SNP median profile (median signal intensities) of all samples (n = 77). Arrows and arrowheads indicate SNPs with LRR values far above and below the mean.
Figure 4. per-SNP system noise. Signal intensities in genomic region 2: 189766706–189891527 shown for four patients (ID 1020; ID 1022; ID1026; ID 1028). The lower panel shows the per-SNP median profile (median signal intensities) of all samples (n = 77). Arrows and arrowheads indicate SNPs with LRR values far above and below the mean.
Microarrays 02 00284 g004
In the current study we propose a preliminary quality metric based on the median number of SNPs per chromosome with copy number state (CN) ≠ 2 (numbers/chromosome for all cases are shown in Table A1). Copy Number state of each SNP was determined by the Affymetrix Power Tools software package (APT). SNPs located in common CNVs were excluded from this analysis. To identify SNPs located in common CNVs, we analyzed 403 control samples without visible waves and with highest genotype call rates selected from a large German population (PopGen [21]), as described before [17]. The median number of SNPs with CN ≠ 2 per chromosome was considered as a preliminary quality metric. The quality of a sample was related to the chromosomal background of SNPs with abnormal copy number (Figure 5). We defined deliberate quality categories: samples were classified as eligible, if the median number of SNPs per chromosome with CN ≠ 2 was zero, those with >100 SNPs with CN ≠ 2 were classified as ineligible.
Figure 5. Quality of copy number samples. Number of SNPs with CN ≠ 2 per chromosome were scored. Sample ID 715 is eligible for CNV studies (most chromosomes without SNPs with CN ≠ 2). Accumulation of aberrant SNPs in chromosome 7 and 18 indicates presence of rare CNVs. Sample ID 50 is of intermediate quality. Sample ID 062 was classified as ineligible for CNV studies (>100 SNPs with CN ≠ 2 in most chromosomes).
Figure 5. Quality of copy number samples. Number of SNPs with CN ≠ 2 per chromosome were scored. Sample ID 715 is eligible for CNV studies (most chromosomes without SNPs with CN ≠ 2). Accumulation of aberrant SNPs in chromosome 7 and 18 indicates presence of rare CNVs. Sample ID 50 is of intermediate quality. Sample ID 062 was classified as ineligible for CNV studies (>100 SNPs with CN ≠ 2 in most chromosomes).
Microarrays 02 00284 g005
Samples were classified according to the defined quality categories in Table 1. The use of freshly prepared DNA (compared to DNA samples that were used since years and had been thawed and frozen repeatedly) was a significant determinant of eligible samples (p < 0.001). Samples with high call rate (rate of successfully genotyped SNPs) were more likely to be suitable for copy number studies than those with lower call rates (p < 0.001). Low levels of wave variance as well as per-SNP variance were associated with eligibility for CNV analysis (p < 0.001). Eligibility for CNV studies was not significantly associated with the median number of calls by PennCNV (p = 0.053). However, eligible samples had between 63 and 165 calls, while the range of calls was much broader in ineligible samples. Birdview yielded significantly more calls in ineligible samples (p < 0.001). The proportion of putative false positive Birdview calls increased with decreasing confidence rates: The number of CNV findings with confidence below 2.5 was most strongly elevated.
Table 1. Characteristics of 77 analyzed samples, classified according to eligibility for copy number variation (CNV) analysis. Numbers indicate mean values and range (lowest–highest value). Mean values were compared between groups with the Chi-2 test or the Kruskal-Wallis test.
Table 1. Characteristics of 77 analyzed samples, classified according to eligibility for copy number variation (CNV) analysis. Numbers indicate mean values and range (lowest–highest value). Mean values were compared between groups with the Chi-2 test or the Kruskal-Wallis test.
IneligibleIntermediateEligibleChi-2/
kruskal-wallis
(n = 29)(n = 25)(n = 23)p
Fresh DNA preparation0 (0.0 %)6 (20.7 %)14 (60.9 %)<0.001
Genotyping call rate94.7 [80.9–97.3]96.6 [94.8–98.3]97.7 [96.6–98.5]<0.001
Autosomal variance0.2291 [0.115–0.706]0.1343 [0.068–0.208]0.0870 [0.062–0.114]<0.001
wave noise0.0109 [0.002–0.058]0.0034 [0.001–0.017]0.0015 [0.001–0.013]<0.001
per–SNP noise0.2259 [0.082–0.696]0.1281 [0.067–0.204]0.0811 [0.060–0.164]<0.001
 PennCNV, No. of calls238 [14–1821)103 [34–1024]98 [63–165]0.053
 PennCNV, % of deletions18.6 [1.3–81.3]27.4 [0.7–65.9]40.0 [10.3–54.8]0.164
Birdview No. of calls527 [163–8,203]225 [154–1,339]208 [163–348]<0.001
 Birdview (cf > 10)15 [2–717]12 [5–33]14 [4–20]0.048
 Birdview (cf = 10)89 [76–145]92 [74–105]94 [77–102]0.209
 Birdview (cf 2.5–10)93 [14–3344]19 [10–361]21 [11–45]<0.001
 Birdview (cf < 2.5)370 [52–5665]106 [35–857]85 [42–194]<0.001
Figure 6 summarizes salient aspects of system noise in SNP microarrays. Figure 6(A) plots for each sample the variances of wave component and per-SNP component. Wave variance and per-SNP variance seem to occur independently from each other: the observed correlation between both noise components (r = 0.124) was not significant (p = 0.401). Figure 6(B) illustrates the relation between sample eligibility and noise components in the eligible (n = 23) and ineligible (n = 29) cases. Eligible samples (i.e., those that are supposed to be excellent for copy number studies) have low levels of per-SNP variance. Samples with high wave variance are inappropriate for copy number studies.
Figure 6. Wave variance and per-SNP variance. (A) Noise components in all 77 samples and (B) in samples of low (O) and high (●) quality (samples of intermediate quality were not included in (B)).
Figure 6. Wave variance and per-SNP variance. (A) Noise components in all 77 samples and (B) in samples of low (O) and high (●) quality (samples of intermediate quality were not included in (B)).
Microarrays 02 00284 g006

4. Noise Reduction in Copy Number Samples

The noise-free-cnv software package permits the visualization of samples, the isolation of noise components and the subtraction of isolated noise components. The next two examples (Figure 7 and Figure 8) illustrate noise reduction by comparing a test sample with a reference sample. We finally demonstrate the use of the noise-free-cnv-filter algorithm for the evaluation of CNVs.
Figure 7 shows a deletion in chromosome 20 of patient ID 1091, which was detected by PennCNV and Birdview analysis. Due to strong waves, reduced signal intensities in the region of the putative deletion are not easily seen. Visual inspection of the LRR values of chromosome 20 after subtraction of a reference sample (A–B) suggested the presence of a true deletion in this patient.
Figure 7. Signal intensities (y-axis: LRR values) of all SNPs from chromosome 18q up to chromosome 22. (A) Patient ID 1091; (B) reference sample ID 2355. After subtraction of the samples, a deletion in chromosome 20 became apparent (arrow).
Figure 7. Signal intensities (y-axis: LRR values) of all SNPs from chromosome 18q up to chromosome 22. (A) Patient ID 1091; (B) reference sample ID 2355. After subtraction of the samples, a deletion in chromosome 20 became apparent (arrow).
Microarrays 02 00284 g007
Figure 8 illustrates the analysis of a mosaic deletion. Although sample ID D62 was classified as ineligible for CNV studies, analysis of SNPs with CN ≠ 2 per chromosome revealed significant clustering on chromosome 5 (Table A1; Figure 5). Neither PennCNV nor Birdsuite identified a large CNV on chromosome 5. After noise reduction, LRR and BAF values were suggestive for the presence of a mosaic deletion [22,23,24] (Figure 8(B,D)). To confirm the diagnosis of a mosaic deletion, a conventional chromosome analysis was performed: Some rare 5q chromosomes were observed amongst a majority of normal chromosome sets. Interestingly, it was recently demonstrated that the identification of mosaic abnormalities by microarray analysis is unreliable [25].
We developed the noise-free-cnv-filter algorithm for optimized noise reduction (Appendix). In the samples of our study population, noise-free-cnv-filter analysis resulted in an average reduction of the wave variance by 74.2%, of per-SNP variance by 35.3% and of the overall variance by 38.1%. Noise-reduction according to this algorithm supports the evaluation of CNV findings, in particular when the putative CNVs are small (Figure 9).
In patient ID 715, both Birdview and PennCNV identified a deletion on chromosome 18 (green bar in Figure 9). Noise-free-cnv-filter analysis of the sample (ID 715 nf) suggested that the deletion was true. Subsequent molecular analysis confirmed the finding: the joining segment of the deletion was identified by a case-specific PCR and the breakpoints of the deletion were identified by DNA sequencing following standard procedures [17,26]. Two putative duplications in patients ID 412 were evaluated after noise-free-cnv-filter analysis. We considered the duplication in chromosome 1 (region 222 Mb) as spurious (red bar), but the duplication in chromosome 9 as probably true. As a consequence, this putative duplication is a candidate for further validation by molecular methods.
Figure 8. Sample with mosaic large deletion in chromosome 5q. (A,B) LRR- and BAF-values of SNPs of chromosomes 5 and 6 of patient. (C) LRR values of reference sample. (D) Signal intensities after subtraction of reference sample. Arrows indicate region with reduced LRR values. (E) LRR values after application of noise-free-cnv blur over 2,000 SNPs. (Bottom panel) Chromosome analysis of cultured peripheral blood lymphocytes from patient (courtesy of Johannes W.G. Janssen, Department of Human Genetics, University of Heidelberg). Arrow points to 5q-minus chromosome.
Figure 8. Sample with mosaic large deletion in chromosome 5q. (A,B) LRR- and BAF-values of SNPs of chromosomes 5 and 6 of patient. (C) LRR values of reference sample. (D) Signal intensities after subtraction of reference sample. Arrows indicate region with reduced LRR values. (E) LRR values after application of noise-free-cnv blur over 2,000 SNPs. (Bottom panel) Chromosome analysis of cultured peripheral blood lymphocytes from patient (courtesy of Johannes W.G. Janssen, Department of Human Genetics, University of Heidelberg). Arrow points to 5q-minus chromosome.
Microarrays 02 00284 g008
Figure 9. Validation of CNV findings. Left panels show crude LRR values, left panels show LRR values after noise-free-cnv-filter analysis. Samples were renamed with suffix “nf” after noise-free-cnv-filter analysis. Bars indicate putative CNV findings.
Figure 9. Validation of CNV findings. Left panels show crude LRR values, left panels show LRR values after noise-free-cnv-filter analysis. Samples were renamed with suffix “nf” after noise-free-cnv-filter analysis. Bars indicate putative CNV findings.
Microarrays 02 00284 g009

5. Conclusions—Proposal of a Two-Step Procedure for the Validation of CNV Findings

Our analysis had the following key findings: (1) Copy number samples may be noisy, which interferes—above a certain level of noise—with reliable identification of CNVs; (2) Eligible copy number samples were more likely when fresh DNA was used for microarray hybridization; (3) wave component and per-SNP component of noise are independent; (4) noise-free-cnv software enables noise reduction by subtracting wave and per-SNP noise components from samples; and (5) noise-free-cnv software supports the quality control of copy number data and the validation of copy number findings.
The current noise-free-cnv version was developed for the analysis of SNP microarray samples and was not designed for noise reduction in array based comparative genomic hybridization samples. The present study highlighted the value of noise reduction for large scale CNV validation (after software-assisted CNV detection). However, the value of noise reduction before software-assisted CNV detection is to be analyzed in future studies.
Based on our analysis of noise in real-life copy number samples we suggested a two-step procedure of CNV validation. As a first step of preliminary CNV validation we proposed large-scale inspection of CNV findings after noise reduction, to select putative candidate CNVs and reject false positive findings. In a second stage, this selection of putative CNV calls is analyzed further by independent molecular methods for final validation [17,26].

Acknowledgments

This work was supported by a grant from the Swiss Heart Foundation.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Girirajan, S.; Campbell, C.D.; Eichler, E.E. Human copy number variation and complex genetic disease. Annu. Rev. Genet. 2011, 45, 203–226. [Google Scholar] [CrossRef]
  2. Zhang, F.; Gu, W.; Hurles, M.E.; Lupski, J.R. Copy number variation in human health, disease, and evolution. Annu. Rev. Genomics Hum. Genet. 2009, 10, 451–481. [Google Scholar] [CrossRef]
  3. Fakhro, K.A.; Choi, M.; Ware, S.M.; Belmont, J.W.; Towbin, J.A.; Lifton, R.P.; Khokha, M.K.; Brueckner, M. Rare copy number variations in congenital heart disease patients identify unique genes in left-right patterning. Proc. Natl. Acad. Sci. USA 2011, 108, 2915–2920. [Google Scholar] [CrossRef]
  4. Priebe, L.; Degenhardt, F.; Strohmaier, J.; Breuer, R.; Herms, S.; Witt, S.H.; Hoffmann, P.; Kulbida, R.; Mattheisen, M.; Moebus, S.; et al. Copy number variants in german patients with schizophrenia. PLoS One 2013, 8, e64035. [Google Scholar] [CrossRef]
  5. Vandeweyer, G.; Kooy, R.F. Detection and interpretation of genomic structural variation in health and disease. Expert. Rev. Mol. Diagn. 2013, 13, 61–82. [Google Scholar] [CrossRef]
  6. Southard, A.E.; Edelmann, L.J.; Gelb, B.D. Role of copynumber variants in structural birth defects. Pediatrics 2012, 129, 755–763. [Google Scholar] [CrossRef]
  7. Zhang, D.; Qian, Y.; Akula, N.; Alliey-Rodriguez, N.; Tang, J.; The Bipolar Genome Study; Gershon, E.S.; Liu, C. Accuracy of CNV detection from GWAS data. PLoS One 2011, 6, e14511. [Google Scholar] [CrossRef]
  8. Dellinger, A.E.; Saw, S.M.; Goh, L.K.; Seielstad, M.; Young, T.L.; Li, Y.J. Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays. Nucleic Acids Res. 2010, 38, e105. [Google Scholar] [CrossRef]
  9. Zheng, X.; Shaffer, J.R.; McHugh, C.P.; Laurie, C.C.; Feenstra, B.; Melbye, M.; Murray, J.C.; Marazita, M.L.; Feingold, E. Using family data as a verification standard to evaluate copy number variation calling strategies for genetic association studies. Genet. Epidemiol. 2012, 36, 253–262. [Google Scholar] [CrossRef]
  10. Marioni, J.C.; Thorne, N.P.; Valsesia, A.; Fitzgerald, T.; Redon, R.; Fiegler, H.; Andrews, T.D.; Stranger, B.E.; Lynch, A.G.; Dermitzakis, E.T.; et al. Breaking the waves: Improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 2007, 8, R228. [Google Scholar] [CrossRef]
  11. Diskin, S.J.; Li, M.; Hou, C.; Yang, S.; Glessner, J.; Hakonarson, H.; Bucan, M.; Maris, J.M.; Wang, K. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 2008, 36, e126. [Google Scholar] [CrossRef]
  12. Van de Wiel, M.A.; Brosens, R.; Eilers, P.H.; Kumps, C.; Meijer, G.A.; Menten, B.; Sistermans, E.; Speleman, F.; Timmerman, M.E.; Ylstra, B. Smoothing waves in array CGH tumor profiles. Bioinformatics 2009, 25, 1099–1104. [Google Scholar] [CrossRef]
  13. Lee, Y.H.; Ronemus, M.; Kendall, J.; Lakshmi, B.; Leotta, A.; Levy, D.; Esposito, D.; Grubor, V.; Ye, K.; Wigler, M.; et al. Reducing system noise in copynumber data using principal components of self-self hybridizations. Proc. Natl. Acad. Sci. USA 2012, 109, E103–E110. [Google Scholar] [CrossRef]
  14. Wang, K.; Li, M.; Hadley, D.; Liu, R.; Glessner, J.; Grant, S.F.; Hakonarson, H.; Bucan, M. PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007, 17, 1665–1674. [Google Scholar] [CrossRef]
  15. Korn, J.M.; Kuruvilla, F.G.; McCarroll, S.A.; Wysoker, A.; Nemesh, J.; Cawley, S.; Hubbell, E.; Veitch, J.; Collins, P.J.; Darvishi, K.; et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 2008, 40, 1253–1260. [Google Scholar] [CrossRef]
  16. McCarroll, S.A.; Kuruvilla, F.G.; Korn, J.M.; Cawley, S.; Nemesh, J.; Wysoker, A.; Shapero, M.H.; de Bakker, P.I.; Maller, J.B.; Kirby, A.; et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat. Genet. 2008, 40, 1166–1174. [Google Scholar] [CrossRef]
  17. Grond-Ginsbach, C.; Chen, B.; Pjontek, R.; Wiest, T.; Burwinkel, B.; Tchatchou, S.; Krawczak, M.; Schreiber, S.; Brandt, T.; Kloss, M.; et al. Copy number variation in patients with cervical artery dissection. Eur. J. Hum. Genet. 2012, 20, 1295–1299. [Google Scholar] [CrossRef]
  18. Wang, K.; Bucan, M. Copy number variation detection via high-density SNP genotyping. Cold Spring Harb. Protoc. 2008, 2008. [Google Scholar] [CrossRef]
  19. Niimura, Y.; Gojobori, T. In silico chromosome staining: Reconstruction of Giemsa bands from the whole human genome sequence. Proc. Natl. Acad. Sci. USA 2002, 99, 797–802. [Google Scholar] [CrossRef]
  20. Costantini, M.L.; Clay, O.; Federico, C.; Saccone, S.; Auletta, F.; Bernardi, G. Human chromosomal bands: Nested structure, high-definition map and molecular basis. Chromosoma 2007, 116, 29–40. [Google Scholar] [CrossRef]
  21. Krawczak, M.; Nikolaus, S.; von Eberstein, H.; Croucher, P.J.; El Mokhtari, N.E.; Schreiber, S. PopGen: Population-based recruitment of patients and controls for the analysis of complex genotype-phenotype relationships. Community Genet. 2006, 9, 55–61. [Google Scholar] [CrossRef]
  22. Piotrowski, A.; Bruder, C.E.; Andersson, R.; Diaz de Ståhl, T.; Menzel, U.; Sandgren, J.; Poplawski, A.; von Tell, D.; Crasto, C.; Bogdan, A.; et al. Somatic mosaicism for copy number variation in differentiated human tissues. Hum. Mutat. 2008, 29, 1118–1124. [Google Scholar] [CrossRef]
  23. Jasmine, F.; Rahaman, R.; Dodsworth, C.; Roy, S.; Paul, R.; Raza, M.; Paul-Brutus, R.; Kamal, M.; Ahsan, H.; Kibriya, M.G. A genome-wide study of cytogenetic changes in colorectal cancer using SNP microarrays: Opportunities for future personalized treatment. PLoS One 2012, 7, e31968. [Google Scholar] [CrossRef]
  24. Laurie, C.C.; Laurie, C.A.; Rice, K.; Doheny, K.F.; Zelnick, L.R.; McHugh, C.P.; Ling, H.; Hetrick, K.N.; Pugh, E.W.; Amos, C.; et al. Detectable clonal mosaicism from birth to old age and its relationship to cancer. Nat. Genet. 2012, 44, 642–650. [Google Scholar] [CrossRef]
  25. Bi, W.; Borgan, C.; Pursley, A.N.; Hixson, P.; Shaw, C.A.; Bacino, C.A.; Lalani, S.R.; Patel, A.; Stankiewicz, P.; Lupski, J.R.; et al. Comparison of chromosome analysis and chromosomal microarray analysis: What is the value of chromosome analysis in today’s genomic array era? Genet. Med. 2013, 15, 450–457. [Google Scholar] [CrossRef]
  26. Vissers, L.E.; Bhatt, S.S.; Janssen, I.M.; Xia, Z.; Lalani, S.R.; Pfundt, R.; Derwinska, K.; de Vries, B.B.; Gilissen, C.; Hoischen, A.; et al. Rare pathogenic microdeletions and tandem duplications are microhomology-mediated and stimulated by local genomic architecture. Hum. Mol. Genet. 2009, 18, 3579–3593. [Google Scholar] [CrossRef]
  27. Frigo, M.; Johnson, S.G. The design and implementation of FFTW3. Proc. IEEE 2005, 93, 216–231. [Google Scholar] [CrossRef]

Appendix: Comments to the Noise-Free-CNV Software

A1. Noise-Free-CNV

The noise-free-cnv program package was specifically developed to analyze copy number variation in SNP-microarray samples and to manipulate the data in order to reduce noise. It was written in C++ and released as free software under the GNU General Public License version 3. Installer packages are available for Debian-based Linux systems and Windows. For the computation of the Fast Fourier Transform, we used the FFTW library [27]. Noise-free-cnv is compatible with the file format used by PennCNV [14].
The central program of the noise-free-cnv package is noise-free-cnv-gtk, a visual editor for interactive visualization and manipulation of SNP microarray data. Besides functioning as a browser for direct inspection and verification of CNV findings, it allows the user to perform many operations on the data. These include the Gaussian filters and variance computation referred to in the article. For further information, see the project homepage http://noise-free-cnv.sourceforge.net. A second program, noise-free-cnv-filter, implements a specific algorithm for system noise reduction, as described below. It is usable as a command line program to be easily applied to a batch of samples.

A2. The Noise-Free-CNV-Filter Algorithm

The noise reduction algorithm noise-free-cnv-filter consists of two main steps. In the first step, a genomic wave profile and a per-SNP noise profile are deduced from a batch of samples. In the second step, these profiles are used to modify the individual samples.

A2.1. System Noise Assessment

For each individual sample:
(1)
The non-autosomal data is removed and the Log R Ratio values are normalized towards an average value of zero.
(2)
The wave component is computed by applying a Gaussian filter with a standard deviation of 1,000 SNPs to the Log R Ratio sequence
(3)
The wave component is subtracted from the Log R Ratio values to calculate the per-SNP component.
Subsequently, the batch-specific wave is computed by regarding each SNP throughout the wave components of all samples and taking the median value. The same is done for the per-SNP profile utilizing the per-SNP components.

A2.2. System Noise Removal

In the second step, we use the median profiles to adjust the original samples.
For each individual:
(1)
The covariance of the wave component and the batch-specific wave profile is divided by the variance of the wave profile.
(2)
The result is used as a scaling factor for the wave profile, the scaled profile is then subtracted from the wave component
Microarrays 02 00284 i001
The same procedure is repeated on the per-SNP components.
(3)
Finally, the corrected components are added together and yield the corrected Log R Ratio values.

A3. Program Usage

Noise-free-cnv-filter was implemented as a command-line program. In the most simple case, it receives the file names of several SNP microarray samples in the PennCNV file format (due to the nature of the algorithm, application on a single sample is pointless). It then computes the profiles (saved as “wave_profile” and “per-snp_profile”) and the cleaned versions of all provided samples, which it saves as “<original filename>.nf”. As additional options, noise-free-cnv-filter allows the use of pre-computed profile sequences and the inclusion of the sex chromosomes into the analysis. As an example, noise-free-cnv-filter—verbose individuals/* applies the algorithm to all files in the directory individuals, discards the sex chromosomes and outputs detailed information about the progress and statistical information about the samples. For further help, type: noise-free-cnv-filter—help.
Table A1. Eligibility of samples.
Table A1. Eligibility of samples.
IDcall ratevar.wave var.per_SNP var.Chr1Chr2Chr3Chr4Chr5Chr6Chr7Chr8Chr9Chr10Chr11Chr12Chr13Chr14Chr15Chr16Chr17Chr18Chr19Chr20Chr21Chr22
398.330.0680.0010.0672002200000011010022000102
1596.020.1440.0010.142104388112050161181501307822112119040
3696.000.2430.0040.2385068456787511,376503977974722385379639593232541752356397364225262203
3895.760.1830.0010.181179801024814080912683384122442714420471144115504020
4896.160.1740.0070.16703960141234290695651059110183236129320941555
4994.810.1740.0070.16720331351114028332214622981541002312070
5097.140.1030.0020.1014611014222422080065001123045022
6297.920.0900.0030.0870140143191511220000006160002
7193.920.2020.0020.20066749032678149402692522151164523196163142266134148248389750
7693.710.2290.0020.22651125120065229593364673524224572338518517025228516146211216749
9789.520.2910.0080.2823,1235,46711,6133,7956,7294,8705,3744,8984,4554,1695,7216,4923,4863,0203,7094,0314,1203,1773,5812,4661,7751,281
10197.850.0770.0020.0750000000000000900000200
11196.510.1750.0030.17270761449287187297664266105476117152007410136621
11294.700.1470.0110.1342,3782,9882,7432,5142,8234,4552,7743,0142,8372,7074,2273,0542,3151,5951,9051,0872,0852,0441,562650580602
12996.610.1390.0030.1357315191935394307723231009602740220
13196.450.1990.0020.1963192291393141081252483041723632796896931182121827183281328
14196.440.1210.0170.10126612117417314764195121258244329145365920210596208908672
14494.720.1760.0050.17025174613046425032825341039013979955553631931785228619591545
16894.360.3150.0360.2755,2055,3825,2724,4794,3375,7376,6824,5043,9662,6272,9015,2942,4923,0652,4873,1262,5592,6582,1791,405961948
18295.020.3160.0020.3131,1891,9882,0511,0962,3012,9911,8142,3651,4086871,1441,3148396546898155691,953427517456377
18890.120.4740.0110.46114,53415,55428,32210,24510,90416,21212,47114,3006,6426,4998,6819,2415,5956,7847,5035,1506,0487,0283,9782,4173,5362,274
18997.320.0970.0120.08412034155123541746902947341661000300352
19397.340.0930.0040.0885330300057162047000000000
41297.510.0920.0060.086006220020021401022000220004
41598.230.1030.0010.1025110531035020020000090002
42196.730.1650.0550.10207000528000000200040000
42296.100.0740.0020.07337,99636,65432,34329,93529,24829,06223,19824,95121,45722,17222,24721,57415,56014,54214,52512,57410,44414,0958,61610,2947,2184,456
43096.390.1600.0190.13810,61410,0969,19616,27815,3837,9597,7589,4656,7576,2956,2916,5594,9553,4872,4913,4593,0745,4041,5252,3993,5691,187
43889.760.4630.0570.39945,98156,16242,40345,75055,97637,90144,21234,06930,38726,84226,46138,93023,96320,80018,25819,16016,62418,3679,63713,4858,7905,479
44297.820.0840.0080.076486382001409572033400000067043
45195.960.2050.0040.20010915449363753611072131241008515012420634601165059019
46180.860.7060.0070.69664,82274,30241,65564,99356,98758,82549,36955,43345,35747,37235,75349,37730,87027,15825,44829,59014,63722,23115,37219,28912,4318,971
61397.640.0900.0010.08826374049150022210200007020
64795.490.1570.0020.15415130166047906454158052841240144411
65398.220.0790.0040.074122141,61826712709000002000044
66597.320.1230.0370.0825,0718,1406,2895,5206,4176,8165,8906,8944,2385,3264,5035,2462,6412,6962,3011,7661,4454,2111,6991,3992,448932
67096.740.1520.0430.1053,1604,4863,8953,9133,8123,0383,3093,6762,3592,1123,3272,7911,5911,5951,1429691,0282,0435431,2131,034286
67597.670.0840.0090.07400304133700584000110048000
67698.150.0780.0010.0770000452180001440000000000
67795.580.2080.0030.204251491615477572671573821063135264344457137121165675
69395.720.1330.0050.1284731863112152770108731410053202012
71597.440.0950.0060.0893500404710200004040240400
71796.600.1340.0010.1320222212941017065710020150034000
72994.750.1890.0010.1871152311518449558491424510111116571038057611920220
73395.840.1980.0090.18823288421821803061804034591132333621086838460
73595.280.1730.0030.1699421111290814111614220790874731344316131810022
74296.830.1140.0130.09900302010402001582180137000
74497.260.1080.0170.0890135126332731482630135315215018034154001661140940
74695.720.2530.0040.2482838953503782873882277235947122295593232069847839868934173184173
75097.150.1140.0150.0972,3896421,7501,5011,3701,7927071,440997615674750559225544631878144581771840
75297.800.1030.0040.0998812111915519683118187932066614727355041327015362725
79694.230.2470.0020.2442,1654825681,1275392633605914406979088061066302564982625651272756748
102098.210.0750.0010.07420002120000300000080160
102297.770.0820.0010.08161623022201201100000010000
102698.340.0660.0010.06410005000002001190003000
102897.490.0890.0010.088900040490300300000000000
102998.540.0620.0010.0603640000000010250025020000
103397.580.0870.0010.085008300020704400080000000
103497.500.0920.0100.0810002001200000000000000
103798.260.0680.0010.06785250262120210200354132200
104096.560.1140.0010.1139017533140000017000000010
104197.160.0940.0010.09300042200001330000000000
104297.460.0870.0030.08415141920112810194226841842242020
105697.110.0980.0030.09520220200000150400350100018
106398.310.0750.0020.072021691000290000020130002000
106598.230.0680.0030.06504270002000080000584000
108897.640.0920.0040.088532361458902676324450713461554022134939
109196.690.1380.0290.1057,6316,0807,1466,5127,2994,0064,9526,6664,1692,5793,9904,0122,3882,1941,4152,6131,0302,8286463,6971,862451
114797.960.0790.0020.07757648004416023202207190000
115197.900.0870.0010.08624003008049000002000112
211093.160.3430.0100.3326713,3073,6141,4143,5483,0352,7752,0701,5812,7182,1532,0541,5731,4269951,1731,5841,358662285971581
213495.730.1340.0080.1251445281515110028887003577355716567412181
214497.120.0930.0040.08925814040201440003000200
224094.480.2990.0040.2944881,6561,0191,7342,1801,6057541,7501,1795319161,728916751643921501427713356339374
235594.500.2580.0260.2291,8701,7991,4227411,4912,8291,6631,3538291,5001,3601,7146546771,2627671,4201,061842826464452
240694.780.1950.0110.183671251361222018014210962615517068347608312253430
D_06294.170.3220.0030.3187728385475364,419436711496239308219582319215222816254010186094
For each sample, genotype call rate, variance, wave variance and per-SNP variance were calculated. The remaining columns show for each chromosome (chromosome number indicated) the number of probe sets with CN ≠ 2. This analysis included only probe sets that had normal copy number (CN = 2) in 403 samples from a population based German study (for details and references see [17]). A non-random distribution of probe sets with CN ≠ 2 is highly suggestive for the existence of a rare CNV (for instance ID 1147 or ID653, in contrast to ID 1042 or ID 2034). Even in samples with high variance, non-random distribution can be detected (chromosome 5 of ID D_062, chromosomes 1 and 2 in ID 442).
Table A2. Analysis of noise components in samples.
Table A2. Analysis of noise components in samples.
IDvariancewave varianceper-SNP variancewave correlationper-SNP correlationwave subtraction factorper-SNP subtraction factor
30.0680.0010.0670.8040.6760.4090.800
150.1440.0010.1420.5670.5640.3930.972
360.2430.0040.2380.8770.3880.9970.865
380.1830.0010.1810.4560.5020.3140.977
480.1740.0070.1670.9390.5081.4050.949
490.1740.0070.1670.9590.5271.4190.985
500.1030.0020.1010.8810.5360.6260.779
620.0900.0030.0870.9290.6090.8860.820
710.2020.0020.2000.3650.5890.2741.203
760.2290.0020.2260.5800.5290.5111.151
970.2910.0080.2820.8750.3601.4270.873
1010.0770.0020.0750.9060.7270.7170.910
1110.1750.0030.1720.9140.4820.9030.914
1120.1470.0110.1340.8640.5781.6710.968
1290.1390.0030.1350.8980.6200.9641.043
1310.1990.0020.1960.8750.4170.7900.844
1410.1210.0170.1010.9490.7032.2971.024
1440.1760.0050.1700.8810.5911.1411.115
1680.3150.0360.2750.9360.4063.2340.975
1820.3160.0020.3130.8090.3860.7040.990
1890.0970.0120.0840.9300.6681.8740.883
1930.0930.0040.0880.9600.7041.1740.956
4120.0920.0060.0860.9500.6911.3040.925
4210.1030.0010.1020.8980.6800.5980.991
4220.1650.0550.1020.8730.5233.7630.763
4250.0740.0020.0730.9080.5720.6530.705
4300.1600.0190.1380.8960.4572.2650.778
4380.4630.0570.3990.9130.3544.0061.021
4420.0840.0080.0760.9420.6631.4960.835
4510.2050.0040.2000.9270.4531.1110.927
4610.7060.0070.696−0.3100.228−0.4600.868
6130.0900.0010.0880.8500.5650.5890.767
6470.1570.0020.1540.7660.5890.6711.059
6530.0790.0040.0740.9380.5691.1410.707
6650.1230.0370.0820.8960.5553.1590.726
6700.1520.0430.1050.9060.5653.4220.837
6750.0840.0090.0740.9510.6721.6470.837
6760.0780.0010.0770.6460.6350.3130.807
6770.2080.0030.2040.9010.4650.8700.960
6930.1330.0050.1280.9530.5811.1790.952
7150.0950.0060.0890.9500.6671.3080.911
7170.1340.0010.1320.5220.6500.3511.081
7290.1890.0010.1870.6060.6110.4111.209
7330.1980.0090.1880.9470.4881.6280.967
7350.1730.0030.1690.9010.6090.9171.144
7420.1140.0130.0990.9560.6492.0170.936
7440.1080.0170.0890.9210.5702.1860.779
7460.2530.0040.2480.9060.4141.0330.943
7500.1140.0150.0970.9400.6122.1370.874
7520.1030.0040.0990.9370.4711.0330.679
7960.2470.0020.2440.0150.5270.0121.192
10200.0750.0010.0740.7420.6140.3480.767
10220.0820.0010.0810.9090.6440.6400.836
10260.0660.0010.0640.8610.6640.5570.770
10280.0890.0010.0880.7420.6430.3800.871
10290.0620.0010.0600.9120.6610.5720.742
10330.0870.0010.0850.8200.7030.5180.940
10340.0920.0100.0810.9630.7091.7320.924
10370.0680.0010.0670.7820.6330.3900.748
10400.1140.0010.1130.6390.7010.3551.077
10410.0940.0010.0930.8500.6720.5460.937
10420.0870.0030.0840.9470.6950.8840.921
10560.0980.0030.0950.9410.6860.8930.966
10630.0750.0020.0720.9240.5710.8320.701
10650.0680.0030.0650.9590.6570.9040.764
10880.0920.0040.0880.9180.4971.0460.675
10910.1380.0290.1050.9120.5372.8540.797
11470.0790.0020.0770.9440.7170.7950.909
11510.0870.0010.0860.6880.5720.3110.769
21100.3430.0100.3320.9480.4081.7151.075
21340.1340.0080.1250.9360.5511.5750.893
21440.0930.0040.0890.9530.6091.0460.832
22400.2990.0040.2940.9090.4331.0601.075
23550.2580.0260.2290.9600.5302.8261.161
24060.1950.0110.1830.9400.5701.8331.113
188c0.4740.0110.4610.8990.3141.7150.975
D620.3220.0030.3180.8190.3400.7610.878

Share and Cite

MDPI and ACS Style

Ginsbach, P.; Chen, B.; Jiang, Y.; Engelter, S.T.; Grond-Ginsbach, C. Copy Number Studies in Noisy Samples. Microarrays 2013, 2, 284-303. https://doi.org/10.3390/microarrays2040284

AMA Style

Ginsbach P, Chen B, Jiang Y, Engelter ST, Grond-Ginsbach C. Copy Number Studies in Noisy Samples. Microarrays. 2013; 2(4):284-303. https://doi.org/10.3390/microarrays2040284

Chicago/Turabian Style

Ginsbach, Philip, Bowang Chen, Yanxiang Jiang, Stefan T. Engelter, and Caspar Grond-Ginsbach. 2013. "Copy Number Studies in Noisy Samples" Microarrays 2, no. 4: 284-303. https://doi.org/10.3390/microarrays2040284

Article Metrics

Back to TopTop