Custom Array Comparative Genomic Hybridization: the Importance of DNA Quality, an Expert Eye, and Variant Validation

The presence of false positive and false negative results in the Array Comparative Genomic Hybridization (aCGH) design is poorly addressed in literature reports. We took advantage of a custom aCGH recently carried out to analyze its design performance, the use of several Agilent aberrations detection algorithms, and the presence of false results. Our study provides a confirmation that the high density design does not generate more noise than standard designs and, might reach a good resolution. We noticed a not negligible presence of false negative and false positive results in the imbalances call performed by the Agilent software. The Aberration Detection Method 2 (ADM-2) algorithm with a threshold of 6 performed quite well, and the array design proved to be reliable, provided that some additional filters are applied, such as considering only intervals with average absolute log2ratio above 0.3. We also propose an additional filter that takes into account the proportion of probes with log2ratio exceeding suggestive values for gain or loss. In addition, the quality of samples was confirmed to be a crucial parameter. Finally, this work raises the importance of evaluating the samples profiles by eye and the necessity of validating the imbalances detected.


Introduction
Array-based comparative genomic hybridization (aCGH) has provided a new impulse to cytogenetic diagnostics and has proved to be a valuable tool in the clinical management of patients with developmental delays and multiple congenital anomalies [1]. This approach has also allowed the identification of novel chromosomal syndromes [2,3], helped to define the clinical variability associated with several genomic disorders [4], and led to the discovery of polymorphic copy number variants (CNVs) in the human genome [5][6][7].
Early CGH arrays were composed of large-insert bacterial artificial chromosome (BAC) clones and later evolved to microarray-based technology oligoarray CGH (aCGH).
There were concerns about procedure variability and interpretation criteria for the clinical application of early versions of targeted BAC clone array because of the presence of false negative results [8]. Two studies attempted to estimate the false positive (FPR) and false negative (FNR) rates.
Wong et al. [9] analyzed six repeated experiments on 95 individuals. Given the very low binomial probability of detecting by chance the same clone twice within six experiments, they assumed that any clone detected twice or more in their experiments was a true CNV [9]. In this way, they calculated a FNR of 45.3% and a FPR of 0.23%. Following a similar method, de Smith et al. [10] calculated an estimate of FNR of 0.16 [10]. The FPR was instead estimated to be 0.05 for multi-probe calls by using three self-self hybridizations of the reference sample and comparing the average number of variant interval calls with that calculated for each sample. However, a few studies validated BAC clone arrays by examining well ascertained CNVs [11,12].
The more recent development of oligonucleotide aCGH has led to a greater resolution in CNV identification. Concordance for oligo aCGH with BAC array was shown [13,14], with around 99% sensitivity and 99% specificity [11,15] and superior performance of oligonucleotide aCGH over BAC clone aCGH [13,16].
However, all these studies based their confirmatory results on imbalances of standard cytogenetic size (Mbs). In addition, high probe density might generate more noise in aCGH data. Using two different Agilent CGH microarrays, it has been shown empirically that subject-to-subject variance is almost twice as large as array-to-array and dye-to-dye variance, supporting results reliability. However, the same study showed that the array-to-array variability was more than 10 times larger than both subject-to-subject and dye-to-dye variance for a custom microarray [17]. This observation was suggested to be ascribable to the fact that to achieve the highest possible density coverage might have lead to include less reliable probes.
The use of oligo array, though increasing resolution, has lead to a lower specificity and higher potential for noise, with need for several adjacent probes to confidently identify CNV regions and a large amount of data analysis and result interpretation. Each of the necessary steps of data transformation, normalization and summarization involves algorithm parameters that directly affect the sensitivity and specificity of the aCGH assay and represents a source of potential Type I and Type II errors [18]. The analysis can be complicated by the presence of platform and method artifacts including GC-waves [19,20] and by centralization methods [21].
More recently, the growing use of custom arrays, which are based on libraries of validated synthetic probes that can interrogate relevant genomic regions, have further enhanced the resolution capabilities of targeted regions [22][23][24][25][26]. In addition, single nucleotide polymorphism (SNP) arrays have been exploited to search for CNVs: not feasible for single-exon resolution throughout the genome, they nonetheless have the advantage to provide genotypes and detect regions of absence of heterozygosity (AOH) thus also allowing the identification of uniparental isodisomies (UPD) and genetic identity by descent [27]. However, the amount of data produced by SNP arrays is computationally challenging and requires a burden of analysis and filters to allow interpretation of results, and it is not yet well described whether the widely used SNP-array-based CNV calling methods can provide sufficient concordance with CGH in CNV detection [28]. In addition, SNP array were shown to not outperform oligonucleotides aCGH in a study that carried out a CNV search by the Affymetrix 6.0 SNP array on patients with developmental disorders already found negative by oligo aCGH at higher resolution [29]. Combining SNPs and oligo arrays in a single assay is increasingly being employed, with the advantage to obtain genotypes and a higher resolution with respect to aCGH data alone [30]. However, if the final goal is limited to the search of CNVs, oligo aCGH remains the most cost effective and straightforward method.
Several quality metric variables can be used to evaluate the quality of the oligo array and dataset, such as probe-to-probe log 2 ratio noise, signal intensity, background noise channels, signal to noise and reproducibility. In addition, CNVs are generally claimed when several probes are indicative of CNVs, although this reduces the array resolution. However, false positive and false negative results are still an issue.
A few studies regarding preimplantation genetic screening in human assisted reproduction showed a not infrequent presence of false positive and false negative results from aCGH [31,32].
In particular, Capalbo A. et al. [31] compared aCGH and qPCR on 120 aneuploid blastocysts, finding that 18.3% of embryos gave a discordant result for at least one chromosome and that most of these were due to aCGH false positive results.
A small number of other studies mention the presence of false positive and false negative results detected, for instance, by the use of different aberration detection methods [24], or the finding of possible aberration missed at aCGH when reanalyzing the results by Next Generation Sequencing (NGS) [26]. However, in our opinion, the presence of false and positive results is not addressed enough in the results reported in the literature.
Abnormal results should be confirmed and various strategies have been described to follow-up analysis, including repeated aCGH testing, fluorescence in situ hybridization (FISH), microsatellite analysis, multiplex ligation-dependent probe amplification (MLPA) and, in particular, real-time quantitative PCR (qPCR). Recently, NGS has also been suggested [26].
Nonetheless, aCGH remains the first-tier testfor CNV detection, due to its genome wide applicability. Validation on a large number of patient and control samples following aCGH analysis is not practical, and it is not always reported in large screenings [33,34], or only few interesting candidates are validated [26].
We took advantage of a custom aCGH, recently carried out on 59 patients affected by Hirschsprung Disease (HSCR) to search for imbalances in genes and loci candidate for HSCR [35], to analyze its performance, the use of several aberrations detection algorithms, and the presence of false positive and false negative results.

Sample Quality and Design Reliability
DLRSs (derivative log ratio spread) and the other metrics of the final 59 samples analyzed for the search of aberrations are shown in Figure 1. The DLRS, in particular, is a measure of the log ratio noise for each sample, calculated as the standard deviation (spread) of the log ratio differences between consecutive probes. that 18.3% of embryos gave a discordant result for at least one chromosome and that most of these were due to aCGH false positive results. A small number of other studies mention the presence of false positive and false negative results detected, for instance, by the use of different aberration detection methods [24], or the finding of possible aberration missed at aCGH when reanalyzing the results by Next Generation Sequencing (NGS) [26]. However, in our opinion, the presence of false and positive results is not addressed enough in the results reported in the literature.
Abnormal results should be confirmed and various strategies have been described to follow-up analysis, including repeated aCGH testing, fluorescence in situ hybridization (FISH), microsatellite analysis, multiplex ligation-dependent probe amplification (MLPA) and, in particular, real-time quantitative PCR (qPCR). Recently, NGS has also been suggested [26].
Nonetheless, aCGH remains the first-tier testfor CNV detection, due to its genome wide applicability. Validation on a large number of patient and control samples following aCGH analysis is not practical, and it is not always reported in large screenings [33,34], or only few interesting candidates are validated [26].
We took advantage of a custom aCGH, recently carried out on 59 patients affected by Hirschsprung Disease (HSCR) to search for imbalances in genes and loci candidate for HSCR [35], to analyze its performance, the use of several aberrations detection algorithms, and the presence of false positive and false negative results.

Sample Quality and Design Reliability
DLRSs (derivative log ratio spread) and the other metrics of the final 59 samples analyzed for the search of aberrations are shown in Figure 1. The DLRS, in particular, is a measure of the log ratio noise for each sample, calculated as the standard deviation (spread) of the log ratio differences between consecutive probes. Figure 1. Quality control metrics: Distribution of the sample quality controls is reported as box plots and as statistics. In particular, sample metrics are highlighted as excellent, good or poor (evaluate) and how many samples are in each category is also reported. Solid circles and asterisks in the box plot graphs represents the outliers: solid circles are cases with values more than 1.5 times the InterQuartile (IQ) range, asterisks are cases with values more than 3 times the IQ range. Figure 1. Quality control metrics: Distribution of the sample quality controls is reported as box plots and as statistics. In particular, sample metrics are highlighted as excellent, good or poor (evaluate) and how many samples are in each category is also reported. Solid circles and asterisks in the box plot graphs represents the outliers: solid circles are cases with values more than 1.5 times the InterQuartile (IQ) range, asterisks are cases with values more than 3 times the IQ range.
To assess the reliability of the results, we correlated the log 2 ratios between replicates as reported elsewhere [36], finding low correlations (mean r = 0.18 across 37 comparisons), though higher than among random sample pairs (mean r = 0.07, p = 0.0040) ( Table 1). Such a low correlation is not unexpected since log 2 ratios not different from zero are supposed to vary randomly. As a matter of fact, considering only log 2 ratios with absolute values above 0.3, the mean correlations improved in both replicated and random sample pairs (mean r = 0.42, p = 1.8 × 10 −9 and 0.14, p = 0.0036, respectively), but at a much higher rate for the replicates (p = 4.8 × 10 −5 ). These results reassure on the design quality and suggest that the application of a minimum log 2 ratio values criterion, such as the mean absolute log 2 ratio > 0.30 (hereafter referred as MALR > 0.30), is reasonable and advisable. Not surprisingly, the level of correlation is dependent on the derivative log ratio spread (DLRS): correlations between replicates and random pairs were significantly different only when at least one sample had DLRS ≤ 0.2 (p = 6.8 × 10 −6 on |log 2 ratio| ≥ 0.3). Correlations were much lower and not significant for the 13 pairs with both samples with DLRS > 0.2 (r = 0.21 for replicates and r = 0.09 for random pairs). Of note, although we could not find any correlation between the DLRS and the log 2 ratio values, besides a faint negative trend, correlations between pairs in which at least one sample had DLRS ≤ 0.2 were significantly higher than correlations between pairs without any sample with DLRS ≤ 0.2, for replicates but not for random pairs (i.e., r = 0.53 vs. 0.21, p = 0.0009 for replicates and r = 0.17 vs. 0.09, p = 0.2188 for random pairs on |log 2 ratio| ≥ 0.3) ( Table 1).
Following the method described elsewhere [10], we estimated the FPR for the Aberration Detection Method 2 (ADM-2) at threshold 6 to be 0.130 for single-probe calls and 0.184 for single-probe calls, higher than that estimated by de Smith et al. [10]. However, all the aberrations called in the three self-self test regarded the high density region around RET (10q11.2), confirming that this sub-centromeric region is problematic, and presented with MALR < 0.3, so that applying such a filter the FPR virtually dropped to 0. Of note, all the calls on this gene, including those that seemed likely or possible, were not confirmed at validation.
We also estimated the FNR in a manner similar to that described by Wong et al. and de Smith et al. [9,10]. In the four replicated experiments, 13 putative variant intervals were observed twice and then considered true calls, yielding an estimate of FNR of 0.50. We also evaluated three samples assayed in triplicate, finding an average FNR of 0.213. The FNR found in the four replicates experiments was slightly higher than that estimated by Wong et al. [9] and definitely higher than that estimated by de Smith et al. [10], while it was closer to their estimates for the three replicates experiments. However, we calculated that if Wong et al. [9] had performed four (or three) replicates, they would have estimated a FNR of 0.3629 and 0.2111, respectively, similar to those obtained by us.
The studies by both de Smith et al. [10] and Wong et al. [9] made use of BAC-based CGH, so that a direct comparison with our results might be misleading. Nevertheless, a high FNR (above 20%), with calls that were missed in one or more replicates, and a very low FPR were confirmed. This latest estimate is based on a reference DNA with very high quality (DLRS < 0.15) and we cannot exclude that FPR would have been higher with worse quality samples. Accordingly, few imbalances could not be validated or confirmed in replicates. exclude that FPR would have been higher with worse quality samples. Accordingly, few imbalances could not be validated or confirmed in replicates.

High Density Design Performance
Notwithstanding the possibility that high density regions might present low replicability and display worse profiles than less high density region, when zooming in the regions we did not observe any higher variability in the profiles between high density probes regions and the rest of the genome ( Figure 2). Accordingly, we found that the number of calls in the selected regions correlated with the number of probes analyzed (r = 0.773) as expected, while it did not correlate with either the size or the probe density of the region selected. These observations were confirmed considering only calls sustained by at least two probes or considering the number of probes called instead of the number of calls. When only calls with MALR > 0.3 were considered, no correlation could be detected at all (Figure 3).

High Density Design Performance
Notwithstanding the possibility that high density regions might present low replicability and display worse profiles than less high density region, when zooming in the regions we did not observe any higher variability in the profiles between high density probes regions and the rest of the genome ( Figure 2). Accordingly, we found that the number of calls in the selected regions correlated with the number of probes analyzed (r = 0.773) as expected, while it did not correlate with either the size or the probe density of the region selected. These observations were confirmed considering only calls sustained by at least two probes or considering the number of probes called instead of the number of calls. When only calls with MALR > 0.3 were considered, no correlation could be detected at all (Figure 3). Redundancy of probes covering the same target sequence might instead create problems in the hybridization and thus false results, as we could observe in a preliminary design.

Comparison between Algorithms and Filters
Applying the ADM-2 algorithm with a threshold of 6, the CG correction and the centralization algorithms, and excluding the positive control regions, we got 572 aberration calls on autosomal chromosomes (for a median of seven calls per sample, range 1-34), 393 of which sustained by at least two consecutive probes (five median calls per sample, range 1-17). We also repeated the analysis without the GC correction but no difference could be detected in the results.
However, at the samples profiles visual inspection, most of the calls seemed to us as false positives, for this reason we have also applied two additional filters, MALR > 0.30 and thresholde > 0.33. With the first filter we obtained 75 calls, six of which corresponding to the already known chromosomal alterations (for these latter we got nine calls, but five were overlapping for the two chromosomal RET deletions on the two sample used as controls for this region and are not reported in Table 2). Two CNVs were not found in the best quality sample firstly evaluated but were added to the list because found in the two additional good quality replicates. These 77 aberrations reduced to 52 considering the thresholde > 0.33, an empirical threshold based on the number of probes with specific log2ratio ranges that we have applied to the present data (see the methods) ( Table 2). Redundancy of probes covering the same target sequence might instead create problems in the hybridization and thus false results, as we could observe in a preliminary design.

Comparison between Algorithms and Filters
Applying the ADM-2 algorithm with a threshold of 6, the CG correction and the centralization algorithms, and excluding the positive control regions, we got 572 aberration calls on autosomal chromosomes (for a median of seven calls per sample, range 1-34), 393 of which sustained by at least two consecutive probes (five median calls per sample, range 1-17). We also repeated the analysis without the GC correction but no difference could be detected in the results.
However, at the samples profiles visual inspection, most of the calls seemed to us as false positives, for this reason we have also applied two additional filters, MALR > 0.30 and threshold e > 0.33. With the first filter we obtained 75 calls, six of which corresponding to the already known chromosomal alterations (for these latter we got nine calls, but five were overlapping for the two chromosomal RET deletions on the two sample used as controls for this region and are not reported in Table 2). Two CNVs were not found in the best quality sample firstly evaluated but were added to the list because found in the two additional good quality replicates. These 77 aberrations reduced to 52 considering the threshold e > 0.33, an empirical threshold based on the number of probes with specific log 2 ratio ranges that we have applied to the present data (see the methods) ( Table 2).  if not validated and/or not confirmed on replicate(s); known = selected controls or known chromosomal rearrangements; unknown = not possible to discriminate between true yes or no; * probes not located in the selected high density regions; § aberration not detected by the software call, but identified by visual inspection; Y = percentage of probes with absolute high log 2 ratio (≥0.5 for gains and ≤−0.8 for loss) above 33.3%; N = percentage ≤ 33.3%, -= not called by the algorithm; ‡ deletion reported as CNV with pathogenicity unknown, reported in an individual with aganglionic megacolon (another name for HSCR), intellectual disability and short stature; § § aberrations assumed as detected because identified in two additional replicates.
Forty-two calls were detected with the same two filters applying the Fuzzy-zero algorithm, while 22 aberrations were called applying the ADM-2 algorithm with a more stringent threshold of 8.
The visual inspection of the sample profiles allowed us to add six calls to the list of aberrations, for a total of 83 aberrations in 61 different chromosomal locations (including controls) in 44 samples.
Excluding aberrations previously reported on DGV and controls, 51 aberrations were detected in 25 patients, two of which repeated in three patients each. The variants called reduced to 25 aberrations in 17 patients when applying the threshold e > 0.33, to 15 aberrations in nine patients applying the more stringent threshold of 8 and to 24 aberrations in 16 patients applying the fuzzy-zero algorithm ( Table 2).
For 15 of the novel aberrations we had at least one replicate sample available, although for two samples the replicate was of low quality (DLRS ≥ 0.3). While one replicate was inconclusive because of a too noisy profile (a low quality replicate), six were definitely not replicated, four were likely although not called by the software and four were clearly replicated, although two showed a different size.

Software Algorithms Calls and Visual Inspection
The visual evaluation of the log 2 ratio sample profiles allowed the identification of additional putative aberrations, not detected by the software, suggesting the substantial chance of false negative results. Of the 44 CNVs classified as likely or possible based on visual inspection and further verified, 39 resulted to be true (confirmed at the validation, detected also on a second replicate or reported on DGV and thus assumed as true, in addition to the six known aberrations all very well visible). Among those classified as unlikely, instead, only four could be claimed as true while eight were excluded at the validation (p = 0.0004), suggesting that the visual inspection of the sample profile is crucial, and even more reliable than the use of algorithms for the variant calls (Table 3). To note, by comparing the visual inspection results with the software calls under various scenario (a more stringent threshold of 8 for the ADM-2 algorithm, the application of the Fuzzy zero algorithm and threshold e > 0.33 filters), under the application of the MALR > 0.3 with at least two probes filter, the ADM-2 algorithm with threshold 6 together with the application of the threshold e > 0.33 filters resulted to be the most comparable with the visual inspection, and the threshold e > 0.33 in general resulted to be a good discriminatory filter (Table 3). Both the ADM-2 algorithm with a threshold of 8 and the Fuzzy zero algorithm missed too many true calls (25 and 15, respectively).

Discussion
The search for CNVs in genes and loci candidate for HSCR in a panel of individuals affected by the disease has provided the opportunity to investigate in more detail the quality of our custom aCGH design and to address some general remarks. First of all, we could confirm that the use of a high density design does not seem to increase the error in variants detection, neither lead to worse sample profiles, thus confirming the validity of this strategy of searching for small imbalances, otherwise undetectable.
The size of the imbalances that can be detected depends upon the density of probes targeting the regions of interest and the criteria set for software-generated calls (i.e., minimum two probes as applied here). Redundancy of probes covering the same target sequence might create problems in the hybridization and thus false results. In addition, not all probes perform equally well. However, the selection of probes and the density in the coverage of the array, which in our design was up to one probe every 250 nt (for RET), might greatly increase the CNV search resolution, provided that there is no probe overlap. We could detect a few imbalances that are less than 5 kb that were successfully confirmed by other techniques.
Not surprisingly, the quality of samples is confirmed to be a crucial step. There is no correlation between the number of calls detected by the algorithm software and the DLRS of the samples, as it is expected since the algorithm already took into account the sample quality. However, the log 2 ratio correlation between replicates was greatly improved when DLRS were excellent (≤0.2). To note, among calls with MALR > 0.3, those detected in samples with excellent DLRS were classified as "likely" or "possible" more frequently than those detected in sample with worse DLRS. "Likely" and "possible" variants could be grouped, although "likely" variants seemed to be true variants more often than "possible" variants (30 vs. 4 compared to 4 vs. 1).
Above all, our study highlighted the importance of the visual inspection by an expert eye. The human eye can take into account several factors such as the general profile of the sample, the specific region profile, and the single log 2 ratio values involved in the putative aberration qualitatively better than any algorithm. Of course, in the case of a large screening, the use of algorithms becomes a pivotal tool. In this case, we noticed that the ADM-2 algorithm with the threshold of 6 suggested by the Agilent company performed quite well, provided that a minimum absolute log 2 ratio for the region is taken into account, such as a MALR > 0.30, as already applied in several studies. In addition, we suggest that also the number of probes concordant should be considered to call the imbalance. For instance, we applied an empirical filter that evaluated if at least one thirds of the probes were above (for gain) or below (for loss) a certain cutoff value, here chosen to be +0.5 and −0.8, finding that the chance to discriminate between true and false calls was greatly improved, especially with the ADM-2 detection algorithm.
A more stringent threshold value for the ADM-2 algorithm did not seem to improve the detection of true imbalances, while, conversely, was at risk of missing the call of several possibly true variants. The same can be said for the Fuzzy Zero algorithm. It is useful for large regions that however are easily detectable as false positive also by the visual inspection of the samples profile.
Very evident large calls, clearly visible at visual inspection and with a neat discrimination above the baseline, were identified with a high level of confidence. They were also easily replied in replicates, including very low quality replicates. This happened for instance for the loss and gain CNVs we included as controls and for two Down syndrome samples. Similarly, the first studies that investigated the reliability of aCGH were mostly based on this sort of gold standard and achieved a complete or almost complete concordance. It remains to explain why other smaller calls seem to be less evident. One possibility is that smaller calls rely on a small number of probes and are therefore more dependent on local probes quality. In addition, problematic regions (such as those closed to centromere) had variability: the RET region, known to be difficulty at amplification and screening, was the one to show most false results. In addition, mosaicism cannot be excluded to explain this observation.
It has been suggested that an additional source of variability is given by bench bias [17]. We did not presently investigate such a matter, but from our observation we cannot exclude that correlation and concordance in aberrations calls is dependent on the array and time at which the experiment has been carried out.
Finally, it has also been shown that the fluorescent dyes commonly used in array CGH, fundamentally the red dyes, are sensitive to ozone, and that ozone has a very strong effect on array data especially during the post-hybridization step [37]. We did not specifically address the possibility of false results in this context; however, the arrays were washed and dried under laminar flow hoods in a semi-darkness environment to reduce ozone exposure. Accordingly, we did not observe a difference in gain/loss ratio among the different groups of CNVs (likely, possible, unlikely, true or not true), with the exception of the variants defined as unlikely, basically due to CNVs bigger than 5 kb. The four unlikely true variants (thus possible false negative) were all gains (red dyes predominant) bigger than 5 kb, while among the eight not true variants (possible false positive), the three variants smaller than 5 kb were all gains, and the five variants bigger than 5 kb were all losses. Thus, we cannot exclude that ozone has affected our data, but if this is the case it seems to affect less the smaller aberrations.
In accordance with the impression we got by visual inspection, we have estimated high false positive and false negative rates for the software calls, though these figures need to be taken cautiously. Indeed, we had some concerns given to the fact that few variants detected were then excluded by alternative methods (such as three unlikely and one likely losses and gains excluded by qPCR) and, on the other hand, we were able to detect variants missed by the software and that were successively confirmed to be true (such as three likely losses and gains confirmed by qPCR). Similarly, a deletion found in a sample for which we had two additional replicates, was also detected on the bad quality replicate but not on the good one. In addition, we found two true aberrations in a sample that was in triplicate that were not detected in the best quality replicate and would have thus gone undetected.
Therefore, our observations strengthen the need to validate the results by means of other techniques, among which qPCR is particularly recommended. A priori use of a combined oligo aCGH and SNP array might have avoided some false results, too. Supposed deletions could have been excluded based on heterozygote genotypes in the same region, for instance. However, the non-uniform distribution of informative SNPs throughout the genome might nullify such advantage in specific regions, especially in the case of small CNVs.
Most of the observations reported here come from visual classification, thus lacking of a proved gold standard. Nevertheless, our study provides a confirmation that the high density design of aCGH does not generate more noise than lower density designs and, in addition, does reach a better resolution, with the finding of validated imbalances smaller than 5 kb. In conclusion, our design proved to be reliable, provided that some filters are applied such as MALR ≥ 0.3. We also propose an additional filter, treshhold e > 0.33, which takes into account the proportion of probes with log 2 ratios exceeding suggestive values for gain or loss. In any case, besides the use of additional filters, we would like to stress the importance of paying a great attention to the observation of the samples profiles and the necessity of validating the imbalances detected.

Microarray Design
The sample was constituted by 59 Italian sporadic HSCR patients, six of whom carrying known chromosomal aberrations at the karyotype level: three chromosome 21 trisomies, an invdup(22)(q11) and two interstitial deletions in 10q11.21. The clinical features, selection and processing of the samples are described elsewhere [35].
We have designed a high-density custom array (8X15K SurePrint G3 Human Kit, Agilent Technologies, Santa Clara, CA, USA) through the Agilent eArray web portal. Genomic DNA (test) and sex-matched controls (Promega, Madison, WI, USA) were labelled and hybridized following the protocols provided by the manufacturers. Spot intensities were processed by Agilent Feature Extraction software and the text file outputs were imported into Agilent Genomic Workbench v. 5.0.14 software (Agilent Technologies, Santa Clara, CA, USA) distributed by the vendor.
The microarray consisted of 8333 probes at a high density, selected to cover 20 HSCR candidate genes as described elsewhere, and 3130 probes scattered along the genome, with a probe density of around 1 probe every 900 kb, that constituted the backbone together with 1262 normalization probes (13 of which located in the selected regions), 301 probes replicated five times (1505 probes), and 1482 control probes provided by Agilent (Table 4). * Twenty-two probes selected among the high density panel were also included in the normalization set or in the replicates set and are not reported among the # of unique probes selected, but considered for the average coverage. Nineteen probes selected in the rest of the genome had already been selected for the high density regions (10) or already part of the normalization set (9).
Gene and locus positions are based on the Human Genome GRCh37 (hg19) assembly of UCSC genome browser [38].

Data Analysis and Structural Variant Detection
To investigate genomic imbalances, we applied the ADM-2 algorithm. A threshold of 6 was set, as recommended by Agilent, and a more stringent threshold of 8 has also been tried.
We applied the centralization algorithm and the GC correction algorithm, and we repeated the aberration detection call both applying and not applying the Fuzzy Zero algorithm.
Finally, we have considered as aberrant only those regions with a minimum of 2 probes and with minimum absolute average log 2 ratio for region >0.3. In addition, we evaluated the effect of an additional criteria, the threshold e > 0.33, namely that at least one third of probes in the putative imbalanced interval (at least 2 probes in the case of aberrations based on 3 probes) must present log 2 ratio above 0.5 or below −0.8 for gains and loss respectively (based on log 2 (3/2) = 0.58, log 2 (1/2) = −1).
We also evaluated the samples profiles by visual inspection, reviewed by a second well-trained operator. Loci with nearby gain or loss intervals and an intervening region of more than 2 probes were considered two separate CNVs, as well as those differing for 2 probes presenting inconsistent log 2 ratio (opposite direction, that is log 2 ratio < −0.3 for gain and >0.3 for deletions).

Statistical Analysis
The Agilent Feature Extraction (FE) processes the data, calculates signal log 2 ratios, estimates errors, and provides basic QC metrics. In particular, we have evaluated: (i) the DLRSpread (derivative log 2 ratio spread), which is a measure of the log 2 ratio noise for each sample; (ii) the BGNoise (background noise), which is a measure of the background fluorescence for each channel (Red and Green); (iii) signal intensity; (iv) signal to noise; and (v) the reproducibility for each channel. Measures were considered as excellent, good or to be evaluated, based on Agilent's guideline ( Figure 1).
Newly extracted or purified DNA was run on an additional array for 10 samples with bad profiles, which are those with DRLS ≥ 0.3, and for 16 samples arbitrarily selected. Moreover, four samples were replicated three times (two of which had the third replicate of bad quality), and one sample was replicated four times. When more replicates were available, the variants search was performed in the sample with the lowest DRLS, or, in case of very similar DRLS among replicates, in the one with an overall better quality. In any case, aberrations detected were evaluated also in the replicated samples, when available.
The FPR was determined comparing the average number of variant calls in self-self tests with the average number of variant calls for each sample, using three self-self hybridizations of a reference sample, as described by de Smith et al. [10]. We estimated the FNR using replicated experiments as described by Wong et al. and de Smith et al. [9,10] and adjusted their estimates based on BAC array by changing number of clones with number of probes. While they based their estimate on 6 and 4 replicates, respectively, we used both four and three replicates. Of note, if Wong et al. [9] had performed 4 replicates only, some calls detected more than once would have been lost (virtually present in one of the two additional replicates no more present). For this reason, we calculated by permutation the expected number of calls replicated more than twice by Wong et al. [9] in the case of four (or three) replicates, assuming a random distribution on the replicates experiments. We considered aberrant intervals revealed in different experiment as identical if the overlap among probes was ≥0.90.
We also calculated the correlation between replicates and between random sample pairs selected to have similar DLRS (<0.2, ≥0.2 and <0.3 or ≥0.3). We repeated the analysis for both all log 2 ratios and log 2 ratio exceeding the threshold of absolute 0.3 (considering that those with log 2 ratio not different from 0 are not expected to correlate at all) and among all the replicated samples and among only those with DRLS below the two cutoffs of 0.3 and 0.2. Moreover, we investigated whether the number of aberrations called by the software correlated with the number of probes called, the size, or the probe density, excluding from the analysis the aberrations used as controls and considering the aberrations that overlapped the high density probes regions and the genome as belonging to the high density group. We assumed the aberration size to be the mean between the inner and the outer probes called.
Finally, we tried to investigate whether there was an association between the calls obtained with the software under different criteria and the visual inspection results.

Detected Variants Classification and Validation
The visual inspection of the aCGHsample profiles has allowed us to classify the aberrations detected as known (the controls), likely, possible but not convincing (possible), or unlikely (Table 2).
Aberrations were compared with CNVs observed in the normal population and reported in the Database of Genomic Variants (DGV) [39] and with the CNVs reported in the DECIPHER database of phenotypes, v8.7 released [40]. The comparison between different platforms and techniques is tricky, exact boundaries of the aberrations detected by arrays are not known but only assumed to be between the last "normal" probe (outer) and the first "aberrant" probe (inner) and depend on the average coverage. However, we considered aberrations as consistent with those already reported in the databases if they showed an overlap ≥80%, did not differ for more than two probes with compatible log 2 ratios (that is ≥|0.3|), and were of the same kind (gain or loss). The frequency of the CNV or the number of individuals in the database in which the variant is reported was not a selection criterion but is reported.
We arbitrarily selected the most promising regions (those classified as likely and not reported on the DGV database) and those more interesting for us (i.e., on the RET gene) for validation with other molecular biology techniques, and parental check, as described elsewhere [35] and reported in Table 2.
We considered as true calls those detected in the controls (already assessed with other techniques or trisomy 21 in patients also affected by Down syndrome), the aberrations confirmed at the validation, the calls detected also on a second replicate and CNVs reported on DGV.

Conclusions
Our study provides a confirmation that the high density design of aCGH does not generate more noise than lower density designs and, in addition, does reach a better resolution. However, false positive and false negative results are not trivial. For this reason, we suggest that some filters are applied such as the MALR ≥ 0.3 and the treshhold e > 0.33, this latter taking into account the proportion of probes with log 2 ratios exceeding suggestive values for gain or loss. We have also shown the importance of visual inspection of results and the necessity of validating the imbalances detected.