Copy number variant detection with low-coverage whole-genome sequencing is a viable replacement for the traditional array-CGH

Copy number variations (CNVs) are a type of structural variant involving alterations in the number of copies of specific regions of DNA, which can either be deleted or duplicated. CNVs contribute substantially to normal population variability; however, abnormal CNVs cause numerous genetic disorders. Nowadays, several methods for CNV detection are used, from the conventional cytogenetic analysis through microarray-based methods (aCGH) to next-generation sequencing (NGS). We present GenomeScreen - NGS based CNV detection method based on a previously described CNV detection algorithm used for non-invasive prenatal testing (NIPT). We determined theoretical limits of its accuracy and confirmed it with extensive in-silico study and already genotyped samples. Theoretically, at least 6M uniquely mapped reads are required to detect CNV with a length of 100 kilobases (kb) or more with high confidence (Z-score > 7). In practice, the in-silico analysis showed the requirement at least 8M to obtain >99% accuracy (for 100 kb deviations). We compared GenomeScreen with one of the currently used aCGH methods in diagnostic laboratories, which has a 200 kb mean resolution. GenomeScreen and aCGH both detected 59 deviations, GenomeScreen furthermore detected 134 other (usually) smaller variations. Furthermore, the overall cost per sample is about 2-3x lower in the case of GenomeScreen.


Introduction
Copy number variations (CNVs) are a phenomenon in which sections of the genome are repeated, and the number of repeats in the genome varies between individuals. CNVs contribute substantially to normal population variability. However, abnormal CNVs are a cause of numerous genetic disorders. Several methods for CNV analysis are used, from the conventional cytogenetic analysis through microarray-based approaches to next-generation sequencing (NGS) [1] .
Array-based comparative genomic hybridization (aCGH) provides genome-wide coverage at a great resolution, even in the scale of tens of kilobases ( 10 -25 kb) [2] . This fact promoted aCGH for a golden standard in CNVs detection for several years [3] despite some limitations in resolution and accuracy [4] .
In contrast, NGS provides a sensitive and accurate approach for the detection of the major types of genomic variations, including CNVs [5,6] . There are three basic strategies for NGS-based CNV analysis, including whole-genome, whole-exome, and targeted sequencing. Whole-genome sequencing allows even base-pair resolution of breakpoints with a sophisticated bioinformatics pipeline [7] . On the other hand, whole-exome sequencing and targeted sequencing aim to reduce the sequencing cost. Still, they are limited to certain regions (protein-coding, or custom), where most known disease-causing mutations occur [8] .
We present GenomeScreen -low-coverage whole-genome NGS based CNV detection method (based on the previously published NIPT CNV detection method [9,10] ) and estimate its accuracy in theoretical and in-silico settings. Furthermore, we compare its sensitivity to the more conventional aCGH method on 106 laboratory prepared clinical samples.

Theoretical minimal read count
The theoretical minimum of reads for predicting a variation with length with the desired Z-score ( ) is estimated l c Z as (see Section 4.2): Standardly, Z-score of 4 is used in the detection of whole chromosomal aneuploidies [11,12] ; however, there are inherently more possible CNVs as whole chromosomal aneuploidies. Thus, the desired Z-score should be much higher in this instance to decrease the number of false positives. Moreover, in practice, the number of reads needed would be even larger due to the uncertainty of sequencing, mapping, and inherent biological biases [13,14] . The theoretical minimal read count estimation for different Z-scores can be seen in Figure 1.

Figure 1.
Theoretical minimal read count for successful estimation of CNV with specified variation length. Different lines represent different Z-score confidence levels.

Detection accuracy for variable CNV lengths and read count ( in-silico )
To verify the theoretically estimated limitations, we first conducted a simulated in-silico experiment. Artificial samples with simulated CNV were created from healthy samples by multiplication of bins corresponding to the simulated region randomly on the genome. Only regions that did not span into filtered positions were kept for further analysis (about 85% of the genome). The details can be found in Section 4.3.
The in-silico analysis shows the influence of read count and CNV length for prediction accuracy ( Figure 2). Based on the findings, we recommend using a read count of at least 8M to achieve >99% prediction accuracy for variations with 100 kb and more. Thus we recommend following the line for Z-score of 8 (red on Figure 1.) for estimation for different CNV lengths.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted September 9, 2020. . Figure 2. Prediction accuracy computed with in-silico analysis based on the length of variation and read count. Each cell number is generated from 8,300 simulations (100 randomly generated aberrations, 83 samples).

Evaluation of clinical samples
Finally, we ran an evaluation of samples analyzed previously in diagnostic settings using aCGH method (Human Genome CGH Microarray 4x44K Agilent [15] ) and GenomeScreen. The chosen aCGH method has 42,494 probes, which result in mean accuracy of detection approximately 200 kb; however, the probes are focused mainly in gene regions and very sparsely in intergenomic regions, therefore the accuracy will be better within and worse outside of genes.
From the 106 tested samples, 58 did not show any detection on aCGH, and the rest contained 59 detections together (lengths from 39 kb to 146 Mb), from which GenomeScreen also detected all. The detections on GenomeScreen and on aCGH have excellent concordance -median 94.37% overlap (more data in Supplementary material Table 1.). GenomeScreen furthermore detected 134 additional variations with ranges from 80 kb up to 1.48 Mb, mainly in regions with a low number of aCGH probes and protein genes, where aCGH has low coverage ( Figure 3. and Supplementary material Table 1. and Figure 1.).
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 9, 2020. .

Discussion
According to the presented results, our method is currently able to detect all tested and almost all simulated variations longer than 100 kb in mappable regions on the human genome; thus, it can replace the more conventional aCGH method. The detection accuracy of GenomeScreen heavily depends on variation length, read count, and genomic position of the variation. The variation length and position cannot be influenced but can be partly compensated with an increase of the read count, which comes inevitably with a higher operational cost.
The disadvantage of the binning approach used here is that the variation location is always reported as a multiplier of the bin-size (20 kb). On the other hand, the aCGH method uses probes, which can be seen as variable size bins, where the resolution is equal to the probe distance (which is sometimes larger than the 20 kb bin-size). Moreover, the bin-size of GenomeScreen can be lowered to increase the precision of the variation location at the cost of deeper sequencing.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 9, 2020. .
The false-positive rate of GenomeScreen is not studied in this work and should be adequately addressed in the future. However, the loss or gain of the (non-mosaic) deviation with a length of at least 100 kb is so substantial, that we do not expect to see any false positive detections.
In our setting, the GenomeScreen method is 2-3x cheaper than aCGH, while the detection accuracy on variations larger than 100 kb is comparable or even better for GenomeScreen [1] . The aCGH method has probes more frequently inside genes, thus the accuracy for the aCGH approach would be higher there (lower outside of genes). In practice, we observed more variations detected by the GenomeScreen method, mainly in the intergenic regions. However, further research is needed to correctly assess the prediction accuracy for both methods for variation lengths between 50 kb and 100 kb and different genomic regions.

Sample collection and processing
Samples of chorionic villi, amniotic fluid, placenta, tissue, or peripheral blood were obtained from 106 patients in the clinical sample group and 789 in the training group. All patients signed written consent for participation in the research. Peripheral blood was drawn in EDTA or STRECK tubes, inverted several times after collection, stored in a chilled environment (4-10°C) for EDTA and at room temperature for STRECK tubes, and transported to the laboratory within 36 hours. DNA was extracted from 200 µl of whole blood or 700 µl of amniotic fluid using the QIAamp DNA Blood Mini Kit (Qiagen, Hilden, Germany) according to the manufacturer's protocol and stored at −20°C until further analysis.
Genomic DNA from clinical samples was fragmented using 1U/μl dsDNA Shearase™ Plus (Zymo Research, Irvine, CA, USA) and incubated 23 min at 42°C to generate 100-500bp fragments. For adapter-ligated DNA library construction, a TruSeq Nano kit (Illumina, San Diego, CA, USA) with an in-house optimized protocol was used. Low coverage sequencing (0.3×) was performed on Illumina NextSeq 500/550 platform (Illumina, San Diego, CA, USA) with paired-end setting 2×35 using High Output Sequencing Kit v2.5. Library quantity and quality were measured by fluorometric assay on Qubit 2.0 (ds DNA HS Assay Kit, Life Technologies, Eugene, Oregon, USA). Fragment analysis was performed on 2100 Bioanalyzer (High Sensitivity DNA Kit, Agilent Technologies, Waldbronn, Germany). We targeted 5M uniquely mapped reads per sample; however, none of the analyses were excluded due to lower (or higher) read counts (more info in Supplementary material Table 1.).

Theoretical minimal read count estimation
Suppose we model sequencing as a random choice of reads from the whole (mappable) genome. Then we can theoretically deduce the number of needed uniquely mapped reads for a certain accuracy criterion. The random choice for a target region is described by the binomial distribution with mean and variance . p μ = n p(1 ) σ 2 = n − p Here, is the probability of choosing a read from the target region, and is the number of reads sequenced. The p n probability can be furthermore expressed as the ratio of the region length to whole-genome length ( p l c l g ). When predicting a CNV, we need to have certain confidence traditionally determined by the Z-score ( /l p = l c g Z ), defined as: CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 9, 2020. .
Here is the number of reads that we observe in the target region. We assume that the number of reads in the target δ region will be proportional to the number of present copies of gonosomes, i.e., either for duplication n(p /2) δ = + p or for deletion of the region on a single chromosome. If we solve for and substitute:

Variant identification
To identify variations, we performed the following pipeline: 1. Mapping and binning a. mapping reads using bowtie2 [16] b. binning reads into same-size 20 kb bins c. normalizing bin counts 2. Normalization (similar to one published previously by [17] ) a. LOESS-based GC correction [18] b. PCA normalization to remove higher-order population artifacts on autosomal chromosomes c. subtracting per-bin mean bin count to obtain data normalized around zero 3. Filtration of unusable bins a. unmappable or badly mappable regions (zero or low mean of bin count) b. repetitive regions or areas with some systematically increased mappability (high mean of bin count) c. highly variable regions (high variance of bin count) 4. Segment identification and reporting a. circular binary segmentation algorithm [19] to identify consistent segments of similar coverage b. assigning significance to segments based on the proportion of reads c. visualization of findings (Figure 3.) Scripts (Python 3.7) and data are available on the website https://github.com/marcelTBI/GenomeScreen .

Mapping and binning
Firstly, the reads are mapped to a reference using Bowtie 2 [16] with --very-sensitive setting. We use hg19 reference in all applications, but other references can be used without changes to the algorithm. The reads are then filtered for map quality at least 40 and binned according to their starts to same-size 20 kb bins. All subsequent analyses are performed on the bin counts, and the algorithm does not use any other information about reads (for example, sequence). For training purposes, the bin counts corresponding to autosomal chromosomes for each sample are normalized to the same number of reads (i.e., each bin is divided so the sum of all bins on autosomal chromosomes would be the same for each sample). Furthermore, the same is done separately for chromosome X and chromosome . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 9, 2020. . Y. A consequence of separate normalization of sex chromosomes is that the current approach can detect only small sex chromosomal variations and not the whole sex chromosomal aneuploidies.

Normalization
Normalization consists of three steps: firstly, a sample-wise LOESS-based GC correction is employed on the bin counts [18] . Afterward, the principal component analysis (PCA) normalization is used to remove higher-order population artifacts on autosomal chromosomes [17] . For training of the PCA, LOESS-corrected bin counts of 789 NIPT samples with female fetuses were converted to principal component space, and the first 15 principal components were stored. The bin count vector of a new sample is then transformed into principal component space defined by these first 15 components and transformed back to the bin space to obtain residuals that are then removed from the bin counts. The first principal components represent noise commonly seen in euploid samples, and their removal helps to normalize the data. Currently, the PCA normalization is done only on autosomal chromosomes due to the unavailability of enough male samples for training. In the future, the training of PCA on both male and female samples is likely to increase the prediction precision for sex chromosomes. Lastly, we subtract per-bin mean bin counts to obtain data normalized around zero. This last step is trained already on the PCA normalized bin counts (where available) and helps compensate for the mapping inequality between various genomic regions.

Filtration of unusable bins
To further improve accuracy, we filter bins that have an unusual signature -either low mean (this signals bad mappability of the region), high mean (repetitive regions or regions with some systematic bias), or high variance (highly variable regions). Furthermore, the filtered regions were manually curated to reduce the scatter of filtered regions, mainly around centromeres and sex chromosomes. The filtration leaves out around 15% of the genome, mainly due to the low mappability, especially in and around centromeres.

Segment identification and reporting
After normalization and filtering, we have a signal (grey dots in Figure 4.) that needs to be segmented into the same level parts to be evaluated. For this purpose, we use the circular binary segmentation (CBS) algorithm implemented in the R package DNAcopy [19] . After segmentation, each segment is assigned a significance level based on its length and difference from zero. Since we know the mean bin counts, we can estimate the level for a complete deletion or duplication of one copy of a chromosome (magenta dashed lines in Figure 4.). We then differ between five color-coded levels of significance: magenta -at least 75%, at least 200kb, red -at least 25%, at least 200kb, orange -at least 25%, at least 40kb, yellow -at least 12,5%, at least 40kb, and green -all others (very short segments or segments around zero). The findings are then reported as a text file for further machine processing, and each chromosome is visualized (Figure 4.).
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 9, 2020. .

In-silico analysis
For in-silico analysis, we chose 83 samples without any aberration and with read count at least 10M. Firstly, the samples were downsampled to the studied read count (3M -10M with the step of 1M). Then, for each of the tested variation lengths (20 kb -200 kb with the step of 20 kb), 100 random variations on autosomal chromosomes were generated that do not overlap with the filtered regions (see Section 4.3.3). To create a sample with an artificial aberration, the bins corresponding to the generated random variation were multiplied accordingly (thus, the most time-consuming mapping step was performed only once per sample). Afterward, variant identification was performed without changes.
In total, we gradually created 664,000 artificial samples (100 variations * 83 samples * 10 variation lengths * 8 read counts) and performed variant identification on them to analyze the impact of read count and variant length. The results are displayed in Figure 2.
. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted September 9, 2020. . . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted September 9, 2020.