Identifying Potential Regions of Copy Number Variation for Bipolar Disorder

Bipolar disorder is a complex psychiatric disorder with high heritability, but its genetic determinants are still largely unknown. Copy number variation (CNV) is one of the sources to explain part of the heritability. However, it is a challenge to estimate discrete values of the copy numbers using continuous signals calling from a set of markers, and to simultaneously perform association testing between CNVs and phenotypic outcomes. The goal of the present study is to perform a series of data filtering and analysis procedures using a DNA pooling strategy to identify potential CNV regions that are related to bipolar disorder. A total of 200 normal controls and 200 clinically diagnosed bipolar patients were recruited in this study, and were randomly divided into eight control and eight case pools. Genome-wide genotyping was employed using Illumina Human Omni1-Quad array with approximately one million markers for CNV calling. We aimed at setting a series of criteria to filter out the signal noise of marker data and to reduce the chance of false-positive findings for CNV regions. We first defined CNV regions for each pool. Potential CNV regions were reported based on the different patterns of CNV status between cases and controls. Genes that were mapped into the potential CNV regions were examined with association testing, Gene Ontology enrichment analysis, and checked with existing literature for their associations with bipolar disorder. We reported several CNV regions that are related to bipolar disorder. Two CNV regions on chromosome 11 and 22 showed significant signal differences between cases and controls (p < 0.05). Another five CNV regions on chromosome 6, 9, and 19 were overlapped with results in previous CNV studies. Experimental validation of two CNV regions lent some support to our reported findings. Further experimental and replication studies could be designed for these selected regions.

number status in DNA pooling data using Affymetrix SNP arrays [18]. Following experimental validation, the authors suggest that applying DNA pooling could help to discover more common CNV regions. However, this algorithm deals only with Affymetrix array-data. For genome-wide CNV array-data from other platforms, there is a need to develop more general filtering procedures to reduce noise and perform data analysis whilst using a DNA pooling strategy. We believe that by minimizing potential errors in CNV calling, the chance for correctly evaluating the relationships between CNVs and the trait of interest would be substantially increased. To our best knowledge, there are no genome-wide CNV studies that are conducted for BPD in Asian populations yet. The goal of the present study is to develop a series of filtering and data analysis procedures to identify potential CNV regions for BPD in a Han Chinese population using a DNA pooling strategy.

Subjects, DNA Pooling Construction and Genotyping
We conducted a family study of mood disorders in Taiwan from 2008-2012. Recruitment and clinical characteristics of the participants are described in more detail elsewhere [19,20]. In brief, patients aged between 18 and 70 years and diagnosed with major depression disorder (MDD), bipolar I disorder (BPD-I), or bipolar II disorder (BPD-II) according to the DSM-IV (Diagnostic and Statistical Manual of Mental Disorders, fourth edition), were consecutively referred by psychiatrists in Taiwan. Independent healthy controls were recruited by methods of sending leaflets or "word of mouth" in the community. All of the controls were screened for mood disturbances and other major psychotic disorders. For every participant, we asked questions about ethnicity; only participants whose parents and grandparents are all Han Chinese were enrolled. The study and data collection procedures were approved by the Institutional Review Broad of all participating institutes and hospitals. All participants provided written informed consent after details of the study were fully illustrated.
Blood samples were taken to extract DNA for each individual. A total of 200 independent BPD-I patients and 200 healthy controls were selected with good quality DNA, and we randomly divided them into eight case and eight control groups (each with 25 subjects). DNA concentration and quality were twice checked using Quant-iT™ PicoGreen ® dsDNA Reagent and Kits (Invitrogen, Carlsbad, CA, USA). Equivalent amounts of DNA from each subject were mixed together to create eight case pools and eight control pools. For details of the experimental procedures and DNA pooling strategies please see elsewhere [21]. Whole-genome pooling genotyping was performed using Illumina HumanOmni1-Quad array with approximately one million markers including SNP and CNV probes. Figure 1 shows the flow chart of our CNV analysis. A series of quality control procedures were implemented to improve data quality before running CNV analysis. We first removed markers with missing signals in any pool or on the sex chromosomes. Markers with a genetic control (GC) score equal to zero in more than three pools were excluded. We also removed markers with a median Log R ratio >1 or <−5 across 16 pools as those outside of this threshold are likely to be false or prone to genotyping errors. The Log R ratio represents the normalized intensity of probe signals. After employing these quality control procedures, 694,475 markers were retained. We used PennCNV [9] for CNV analysis. PennCNV is a HMM-based algorithm for CNV calling, which uses normalized intensity (Log R ratio) and allele frequency data (B allele frequency) of markers to simultaneously estimate CNV region and its copy number status [9]. It could analyze array signal data from both Affymetrix and Illumina platforms. We applied this algorithm to call CNV numbers for each pool. If the copy number is equal to two, the Log R ratio is approximately zero. If the copy number is greater (a gain CNV status) or less (a loss CNV status) than two, the corresponding Log R ratio is higher or lower than zero, respectively. The estimated CNV regions for each pool were then identified using PennCNV. We set a series of criteria to obtain informative CNV regions. First, regions with less than 20 markers were filtered out to avoid false-positive results in short regions with PennCNV analysis. A large number of gain CNV regions were predicted from 16 pools. To further reduce the likelihood of obtaining false positive findings in the gain CNV regions, we applied other criteria for the gain CNV regions by Log R ratio to increase data quality; (1) If the mean Log R ratio within the identified CNV region is less than 0.02, we filtered out this region as the intensity around zero indicating a high potential to be a normal CNV status (copy number = 2); (2) If the standard deviation of Log R ratio within the identified CNV region is greater than 0.2, we filtered out this region. The second procedure is also suggested by PennCNV for the quality control of individual samples [9].

Quality Control and Filtering Procedures for CNV Analysis
Because PennCNV tends to split large CNV regions into multiple smaller regions, a merge procedure for adjacent CNV calls was applied in the next step [9]. We performed a gap cleaning procedure to merge neighboring CNVs where the ratio of the gap length and sum of the two neighboring CNV lengths is less than 0.2. At this step, we had obtained around three thousand estimated CNV regions with length ranging between 1.49 kb and 1021.08 kb in the 16 pools. To make comparisons of CNV results possible across pools, we took the union of each defined CNV region for all the pools. In total, there were 2243 unique CNV regions in case pools and 2426 unique CNV regions in control pools. In addition, different CNV calling algorithm could result in different calling results; therefore, we also used QuantiSNP [10] to analyze potential CNV regions identified by PennCNV.
For small sample size, several strategies were implemented to conduct association testing. In the present study, we first constructed the Han Chinese CNV map. We used the published CHB (Han Chinese in Beijing) CNV regions in HapMap and 2 CNV databases from Lin et al. [22] and Lou et al. [23]. We selected more informative CNVs according to the different CNV patterns between case and control pools. The informative CNVs were defined based on the following criteria.
(1) the CNV regions were only found in case pools but not in the Han Chinese CNV map; these CNVs were defined as important regions in cases; (2) the CNV regions were only found in control pools and also reported in the Han Chinese CNV map; these CNVs were defined as important regions in controls; (3) the CNV regions were shown in both case and control pools, but the frequency difference in the two groups is large (>3); these CNVs were defined as enriched in cases or controls; (4) the CNV regions were found in both case and control pools, however the CNV status (gain/loss) was different in the two groups.
These selected CNV regions were potential targets for BPD and were included in the following analyses.

Association Testing for CNV Regions with BPD
We first conducted CNV burden analysis between case and control pools, which is employed in previous studies [24]. We conducted burden analyses stratified by CNV types (gain or loss) and size (length ≥100 kb or ≥500 kb). Secondly, using a data integration framework, we had previously built a candidate gene database for BPD, and obtained 164 prioritized susceptible loci, namely BPDgenes [25]. We mapped genes for the potential CNV regions and compared them to the BPDgenes. For the mapped genes in the CNV regions that overlapped with BPDgenes, we tested the signal differences between case and control pools. We calculated the median signal of the markers within the defined CNV regions. Both t-test and Wilcoxon test were used to evaluate signal associations between cases and controls. Thirdly, a functional enrichment analysis for the mapped genes in selected CNV regions was performed using WEB-based GEne SeT AnaLysis Toolkit (WebGestalt) [26]. We adopted multiple testing corrections using the Benjamini-Hochberg method for each analyzed pathway with the significance level p < 0.05. Finally, we searched the literature to identify previously reported CNV regions in BPD. For associated CNV regions that were found in more than one patient in previous studies, we compared them with our potential CNV regions. These results are summarized in Appendix Table A1.

CNV Validation by RT-qPCR
Due to the genome-wide scale of the CNV identification, we were only able to conduct experimental validation for a few CNV regions for the proof of principle of our filtering procedures and data analysis results. Two CNV regions were selected for validation using real-time quantitative polymerase chain reaction (RT-qPCR). The first region was the 6q27 CNV (Results Section) that replicated results from two previous studies, the second was a CNV region on chromosome 3p14.2 (Results Section) showing signal differences between CNV carriers and non-carriers, including a BPD candidate gene PTPRG. We performed RT-qPCR using Taqman Copy number assays (Chr.3 Hs04761773_cn and Chr. 6 Hs03602538_cn) and Taqman Copy number reference assays (Applied Biosystems, Foster City, CA, USA). Individuals in the carrier case-pool and non-carrier control-pool (25 cases and 25 controls) were tested in each assay, and the RT-qPCR was carried out in triplicate. Sequence Detection Software (SDS) was used for exporting the threshold cycle (Ct) data and further analyzing differences in Ct values (ΔCt) between the test locus and the control locus. Copy number variation was analyzed with the CopyCaller software [27]. We used Student's t-test to compare raw copy number signals calculated from ∆Ct values to determine the statistical significance of predicted copy-number differences in cases and controls. The significant threshold was defined by p < 0.05.

Results
Results of the CNV burden analysis for the potential CNV regions are displayed in Table 1. There was a relatively higher proportion of loss CNV regions (≥100 kb) in BPD patients than in controls, though the difference did not reach statistical significance (Wilcoxon p-value = 0.105). There was no significant difference in other types of the CNVs between BPD cases and controls. Table 2 shows the results of gene mapping in the selected CNV regions. There were 882 CNV regions that were only found in case pools and not mapped to the Han Chinese CNV map (Important regions in cases). In contrast, 94 CNV regions were only found in control pools and were mapped to the Han Chinese CNV map (Important regions in controls). In addition, two CNV regions were enriched in cases for which the frequency of case pools having this CNV is three more than that in control pools, and 26 CNV regions were enriched in controls. Only one CNV region had a different CNV status, with loss status in cases and gain status in controls. In total, 1247 genes were mapped to these selected CNV regions, and 30 of them were overlapped with the prioritized genes in the BPDgenes [25]. We compared the CNV signal differences between cases and controls and focus on the regions that had affected genes mapped to the BPDgenes (i.e., 30 CNV regions). We presented the CNV regions that exhibited signal differences between cases and controls with p-value <0.2 using t-test or Wilcoxon test, with these CNVs regarded as the top priority regions for BPD in our samples (see Table 3). Two of the CNV regions showed a significant difference (at p < 0.05) using either a t-test or Wilcoxon test. This included CNV regions on chromosomes 11 and 22.
Functional enrichment analysis was performed for the 1247 CNV genes to explore their biological information. These mapped genes were significantly enriched (adjusted p-value less than 0.05) in 21 GO terms (see Table 4). In addition, the enrichment analysis was conducted using the database of disease associated genes in WebGestalt. Two disease categories (bipolar disorder and mood disorder) reached statistical significance (p < 0.05) (Appendix Table A2). Lastly, we compared our CNV results to findings in the previous studies of BPD [7,24,[28][29][30][31][32][33]. We found that five of the 1005 selected CNV regions (6q16.3, 6q27, 9q34.3, and two regions on 19p12) were also identified in previous studies (see Table 5). All of the overlapped regions were only found in cases. Four were gain CNV regions, and one region on 19p12 was loss CNV status. Priebe et al. found a CNV region at 6q27 that was overrepresented in bipolar patients with age at onset ≤21 [28]. This region was also reported to be enriched in affected members of a three-generation Amish pedigree of European descent [29]. Two CNV regions on chromosome 19 were both located at 19p12. Grozeva et al. found that these CNV regions were associated with BPD in Wellcome Trust Case Control Consortium samples [30]. In addition, Bergen et al. reported a duplication CNV region (9q34.3) in 14 patients with BPD, 17 patients with schizophrenia, and 11 normal controls [31]. Association testing of this region was significant when combining the two patient groups. Moreover, McQuillin et al. reported CNV regions at 6q16.3 in 2 BPD cases in British samples [32]. The rest of the 1000 CNV regions were not reported in more than one individual in previous studies.  In experimental validation of CNV at chromosome 3p14.2, two BPD individuals and one control subject out of 25 cases and 25 controls were found to have gain CNV at PRPRG region. The signal intensity of copy number value in the gain CNV that compares with a CNV status of 2 was 2.52 ± 0.002 (mean ± SD) versus 1.98 ± 0.033 (mean ± SD). The difference reached statistical significance (p < 0.05) using t test. Thus, confirmatory RT-qPCR experiments lent further support for this CNV validation. For the chromosome 6q27 region, this region was overlapped with CNV findings reported in previous BPD studies but did not show signal difference between our cases and controls. Experimental results showed that no individual (out of 25 cases and 25 controls) was validated to have gain CNV. The signal intensity of copy number value in cases and controls was 1.99 ± 0.03 and 1.92 ± 0.05, respectively.

Discussion
The current study applied a series of filtering and data analysis procedures to identify CNV regions that are related to bipolar disorder using a DNA pooling strategy. Several filtering methods have previously been developed for array data to increase detection power, in particular for microarray gene expression data. A screening threshold is usually set based on the variance of expression signals, where probes with low variance are excluded as non-informative markers [34,35]. The filtering procedures become even more important while adopting a DNA pooling strategy. In a genome-wide association study using pooled DNA, SNP quality control filters are set based on the indicators calculated from pooled intensity [36]. Similar concepts are adopted in our filtering scheme for pooling CNV data. The criteria we set for a CNV analysis with small sample size could assist for CNV identification by reducing the potential impact of experimental noise to explore the relationships between CNVs and the trait of interest. A number of potential CNV regions are reported for BPD in our Han Chinese samples that may warrant further investigation.
Higher CNV burden is often observed in patients with psychiatric disorders when compared with healthy controls [37]. However, the reported specific CNV regions, even in large-scale Caucasian samples, are rarely replicated [24,28,30,32]. In the present study, we found that the burden of loss status CNVs in BPD cases is higher than in controls, though the comparisons did not reach statistical significance. Similar findings are reported in Zhang et al., which conducted a genome-wide CNV study of BPD in European Americans. They found that the number of singleton deletion CNVs in BPD cases is significantly higher than those in controls (p = 0.007) [38]. Other studies reported fewer CNVs with loss status in BPD cases than controls. For instance, in a young adult British sample, McQuillin et al. found that BPD subjects have significantly fewer deletion CNVs, with the size ranging from 200-500 kb compared to controls (p = 0.039), while fewer singleton duplication CNVs with the size over 100 kb are also found in BPD cases (p = 0.03) [32]. In addition, large (≥500 kb) inherited duplication CNVs are also found to be enriched in familial BPD cases (p = 0.03) [24]. The distinct findings of excessive deletion or duplication CNVs among BPD patients in the previous studies may result from the differences in sample populations, clinical characteristics of BPD cases, the CNV detection platforms, and CNV analysis criteria. Some studies advocate to subgroup BPD patients to obtain genetically more homogeneous groups. Age at onset is an often considered feature. Two CNV studies divided BPD cases into early or late onset subgroups by the age of onset (AO) of BPD diagnosis [24,28]. Comparing with healthy controls, one study reported that the rate of de novo CNVs is significantly higher in the patients group with AO ≤ 18 [24], whilst the other reported a higher frequency of microduplication CNVs in patients with AO ≤ 21 [28]. Two regions with duplication CNVs-the 6q27 and 10q11 CNV regions-are especially noted for early onset BPD [28]. To stratify BPD patients into subgroups based on relevant clinical characteristics could be considered in future CNV studies to reduce heterogeneity among patients.
Through a series of data analysis procedures and a comprehensive literature search, we identified several CNV regions in relation to BPD. Some of the regions are reported in previous CNV studies, and some are novel regions. Novel CNV regions may be ethnic group specific and provide additional clues for exploring the pathogenesis of BPD in Han Chinese population. At the first stage of data screening, we found approximately 1000 novel CNV regions. Using a gene prioritization framework, we had previously built a gene database for BPD. The top list in the BPDgenes has a higher combined score, and thus higher confidence to be associated with bipolar illness. There are 30 genes in our identified CNV regions that are mapped to the BPDgenes (see Table 2), and the CNV regions that encompass these genes are considered high priority for further association testing. We reported signal differences between cases and controls for CNV regions on several chromosomes (see Table 3). These CNV regions have a higher potential to be related with BPD, and genes mapped to these regions are candidate genes for BPD. For instance, ARNTL is a circadian gene and has been found to be associated with BPD in Caucasian samples [39,40]. Gene PTPRG has previously been found to be associated with schizoaffective disorder [41]. We conducted RT-qPCR experimental validation for the PTPRG gene region, and the gain CNV status was validated, for which the signal intensity was higher in BPD cases than in controls. Research on the functional properties of these affected genes in potential CNV regions and how they link to the etiology of BPD may help point to a direction for the development of a new drug target.
In addition to novel regions, there are five CNV regions (located at 6q16.3, 6q27, 9q34.3, 19p12) overlapped with the results from previous studies (see Table 5). One CNV region, 6q16.3, is reported to be associated with BPD in both ours and the study conducted by McQuillin et al. [32]. Gene GRIK2 is mapped to this CNV region. This gene is essential for brain development [42]. Previously, polymorphisms in GRIK2 gene have been reported to exhibit associations with obsessive-compulsive disorder [43] and autism spectrum disorders [44]. A loss CNV was found in our cases in 19p12 while a gain CNV in the same region was reported in Grozeva et al. [30]. Olsen et al. conducted a meta-analysis for three CNV regions-6q27 and 19p12 (two CNVs)-that are overrepresented in patients with affective disorder in three case-control studies [45]. However, the association testing is not significant for the three CNVs. As very few individuals possess either of these CNVs, a reliable test is not easy to perform for testing CNV associations with disease outcomes. Further replicated studies with larger sample size are needed to verify the relationship between candidate CNVs and BPD.
The heterogeneity of genetic architecture across populations often leads to diverse genetic findings on the phenotypic outcomes of interest [46,47]. The diversity of CNVs in different ethnic groups has also been noted previously [48,49]. Among CNV studies in BPD, to our best knowledge, we are the first to conduct a genome-wide level of CNV analysis in an Asian population. For identified potential CNV regions for BPD, we also compared our results with findings from previous studies in different samples. The CNV regions that reported consistently in ours and previous studies may represent common risk regions across populations for BPD. If validated by experiments, novel CNV regions that were only reported in our study may indicate population specific genetic components for BPD, such as the CNV region on chromosome 3p14.2.
By conducting functional analysis using GO terms for our mapped CNV genes, we found that the top three enriched pathways were involved in biological adhesion, cell adhesion, and neuron projection (Table 4). Several other genetic association studies, but not CNV studies, have also performed pathway analysis for BPD related candidate genes. The top significant GO pathways reported in Chang et al., are amine binding, synapse transmission, and transmission of nerve impulse [50]. Another study applied pathway analysis while incorporating information of allele-specific gene methylation [51]. They reported enriched pathways for extracellular matrix in brain, gated ion channel, and neurotransmitter receptor related pathways. Their findings support the involvement of biological functions of cell adhesion and neuronal transmission underlying bipolar illness. Further studies to investigate the interaction and networks among identified molecules for BPD could be conducted to understand the pathophysiology of BPD.
There are several limitations in the present study. First, DNA pooling strategy is restricted to the original study design (i.e., for our study, bipolar disorder vs. control) and not flexible for conducting secondary data analysis. If there is a belief of true genetic heterogeneity in disease subtypes or the genetic factors are influenced by other covariates, it is not possible to adjust results for these concerns. Second, employing a series of data filtering steps may cause false-negative findings for certain CNV regions. In addition, because there is no consensus of a standard method for CNV calling, we used a second calling algorithm for our identified CNV regions. In the 12 reported CNV regions (listed Tables 3  and 5), only the loss CNV region at chr19: 24193894:24282139 were consistently called by both calling algorithms. It is consistent with prior study showing that both algorithms have high reproducibility rates in loss CNV regions, but low rates in gain CNV regions [52]. In future study, applying multiple CNV calling algorithms and conducting experimental validation are desired. Third, due to having a small sample size we could only identify relatively common CNVs. Very rare or de novo CNVs are likely to be ignored. Nevertheless, we conducted power calculation [53] using median Log R ratio within a CNV region. We took one CNV at chr11: 13,224,130-13,256,233 as an example (see Table 3). The power for detection of signal difference between cases and controls can reach 0.87 in the current study. Follow-up individual studies with larger sample sizes should be designed to validate and test associations for identified CNV regions. Fourth, CNV regions that do not have any mapped genes (i.e., in gene desert regions) are not reported as the potential roles of these CNVs are not clear. Lastly, the experimental quality to detect CNV signals is a concern for pooling based design as there are no easy indicators to estimate the accuracy of CNV signal intensities. Other than two selected CNVs for experimental validation, we do not conduct a genome-wide level of validation and independent replication studies for our identified CNV regions. Further large-scale individual and replication studies are needed to investigate the roles of these CNVs and eventually provide clues for the underlying mechanisms of bipolar illness.

Conclusions
There are many difficulties faced in performing CNV studies as most of the disease associated CNVs are complex, rare and usually with marginal effect size. The heterogeneity of bipolar disorder brings another challenge in explaining the CNV results. Proper data filtering and analysis strategies are recommended in exploring the relationships between CNVs and the trait of interest. We conducted the first pilot study of CNV association with BPD in Han Chinese population and identified several potential CNV regions for BPD. It is important to design further validation experiments and perform basic research for these CNVs to reveal their biological roles and explain their involvement in bipolar illness.
CNV data and results interpretation. We also thank S.J. Lin who helped to conduct experimental validation and A. Woolston who assisted for English editing. We especially thank all participants who agreed to join this study.

Authors Contributions
Yi-Hsuan Chen and Po-Hsiu Kuo conceived and designed the experiments. Ru-Band Lu assisted for data collection. Yi-Hsuan Chen carried out the data analysis. Hung Hung and Po-Hsiu Kuo assisted with interpretation of the data. Yi-Hsuan Chen and Po-Hsiu Kuo drafted the manuscript. All authors read and approved the final manuscript.

Conflicts of Interest
The authors declare no conflict of interest.