Identification of Copy Number Aberrations in Breast Cancer Subtypes Using Persistence Topology

DNA copy number aberrations (CNAs) are of biological and medical interest because they help identify regulatory mechanisms underlying tumor initiation and evolution. Identification of tumor-driving CNAs (driver CNAs) however remains a challenging task, because they are frequently hidden by CNAs that are the product of random events that take place during tumor evolution. Experimental detection of CNAs is commonly accomplished through array comparative genomic hybridization (aCGH) assays followed by supervised and/or unsupervised statistical methods that combine the segmented profiles of all patients to identify driver CNAs. Here, we extend a previously-presented supervised algorithm for the identification of CNAs that is based on a topological representation of the data. Our method associates a two-dimensional (2D) point cloud with each aCGH profile and generates a sequence of simplicial complexes, mathematical objects that generalize the concept of a graph. This representation of the data permits segmenting the data at different resolutions and identifying CNAs by interrogating the topological properties of these simplicial complexes. We tested our approach on a published dataset with the goal of identifying specific breast cancer CNAs associated with specific molecular subtypes. Identification of CNAs associated with each subtype was performed by analyzing each subtype separately from the others and by taking the rest of the subtypes as the control. Our results found a new amplification in 11q at the location of the progesterone receptor in the Luminal A subtype. Aberrations in the Luminal B subtype were found only upon removal of the basal-like subtype from the control set. Under those conditions, all regions found in the original publication, except for 17q, were confirmed; all aberrations, except those in chromosome arms 8q and 12q were confirmed in the basal-like subtype. These two chromosome arms, however, were detected only upon removal of three patients with exceedingly large copy number values. More importantly, we detected 10 and 21 additional regions in the Luminal B and basal-like subtypes, respectively. Most of the additional regions were either validated on an independent dataset and/or using GISTIC. Furthermore, we found three new CNAs in the basal-like subtype: a combination of gains and losses in 1p, a gain in 2p and a loss in 14q. Based on these results, we suggest that topological approaches that incorporate multiresolution analyses and that interrogate topological properties of the data can help in the identification of copy number changes in cancer.

a sequence of simplicial complexes, mathematical objects that generalize the concept of a graph. This representation of the data permits segmenting the data at different resolutions and identifying CNAs by interrogating the topological properties of these simplicial complexes. We tested our approach on a published dataset with the goal of identifying specific breast cancer CNAs associated with specific molecular subtypes. Identification of CNAs associated with each subtype was performed by analyzing each subtype separately from the others and by taking the rest of the subtypes as the control. Our results found a new amplification in 11q at the location of the progesterone receptor in the Luminal A subtype. Aberrations in the Luminal B subtype were found only upon removal of the basal-like subtype from the control set. Under those conditions, all regions found in the original publication, except for 17q, were confirmed; all aberrations, except those in chromosome arms 8q and 12q were confirmed in the basal-like subtype. These two chromosome arms, however, were detected only upon removal of three patients with exceedingly large copy number values. More importantly, we detected 10 and 21 additional regions in the Luminal B and basal-like subtypes, respectively. Most of the additional regions were either validated on an independent dataset and/or using GISTIC. Furthermore, we found three new CNAs in the basal-like subtype: a combination of gains and losses in 1p, a gain in 2p and a loss in 14q. Based on these results, we suggest that topological approaches that incorporate multiresolution analyses and that interrogate topological properties of the data can help in the identification of copy number changes in cancer.
Keywords: breast cancer subtypes; copy number aberrations; topological data analysis; TAaCGH

Introduction
Chromosome aberrations are large-scale structural changes of the genome that are commonly associated with cancer initiation and progression [1][2][3]. DNA copy number aberrations (CNAs), such as copy number gains and losses, are of particular interest, because they may respectively harbor oncogenes and tumor suppressor genes; hence, they have the potential to directly regulate cellular growth pathways (reviewed in [4][5][6]). CNAs that contain oncogenes or tumor suppressor genes are commonly known as driver aberrations; those that do not have functional implications are termed passenger aberrations. Genome-wide experimental detection of CNAs is achieved through microarray and/or DNA sequencing technologies [7][8][9][10][11][12]. Identification of driver CNAs, however, still remains a challenge [13][14][15][16][17][18][19]. One approach to identify such aberrations is through statistical supervised methods [19][20][21], which detect CNAs that are common and specific to a given cancer subtype or a cancer with specific clinical characteristics. Here, we propose a supervised method that identifies CNAs based on the topological properties of the aCGH profile. We call this method topological analysis of aCGH (TAaCGH).
TAaCGH associates a point cloud with each aCGH profile by means of a sliding window map [22,23] and uses the topological properties of the point cloud, obtained by standard techniques of persistence homology (reviewed in [24,25]), to identify regions of amplifications and deletions. Two properties differentiate our approach from other commonly-used supervised methods. First, TAaCGH performs a multiresolution segmentation of the data, similar to that of wavelets [26], and second, TAaCGH interrogates the topological properties of the data, rather than each of the independent clones or segmented regions of each patient profile.
In this study, we tested our approach by identifying CNAs that are specific to the molecular subtypes of breast cancer ( [27,28], also reviewed in [29,30]), since it is known that different subtypes have different regulatory mechanisms and, in some cases, well-determined patterns of driver CNAs [31][32][33][34][35][36]. Several studies have also reported the association between CNAs and the evolution of the tumor or the response to treatment [7,[37][38][39][40][41][42][43][44][45]. Therefore, an additional important aspect of CNA studies is the possibility to identify prognostic subgroups with different outcomes and/or responses to treatment within each gene expression subtype. We analyzed the data reported in [33] where CNAs associated with molecular subtypes Luminal A, Luminal B, ERBB2/HER2/NEU (denoted by HER2+) and basal-like were identified using the supervised algorithm called Supervised Identification of Regions of Aberration in aCGH (SIRAC) [21]. TAaCGH found all regions reported in the original publication for the Luminal A and HER2 subtypes and a new amplification at the location of the progesterone receptor gene (11q) in the Luminal A subtype. In the basal-like subtype, TAaCGH found all aberrations reported in [33], except 8q and 12q; these two CNAs were found upon removal of three patients that had exceedingly large copy number changes. Interestingly, TAaCGH also found 21 additional regions in the basal-like subtype, including a combination of copy number gains and losses in 1p, a gain in 2p and a loss in 14q. The Luminal B subtype only revealed specific CNAs when the basal-like subtype was removed from the control set. Under those conditions, TAaCGH found all CNAs reported in [33], except 17q and 10 new aberrations. Most of these newly-identified regions have been reported in other independent studies and were validated using an independent dataset [32] and/or using GISTIC [13]. We therefore suggest that the use of topological data analysis can help identify new aberrations in cancer.

Simulation Data
Each simulated dataset consisted of 120 aCGH profiles, with 100 clones each. The 120 profiles were equally split between the test and control sets. The implementation of the simulated profiles followed the work of [21,46], where the copy number value of each clone was drawn from a Gaussian distribution of mean µ = 0 for clones inside an aberration and of mean µ = 0 for clones outside any aberration. The standard deviation σ was constant for all clones in any given simulation. The mean value of an aberration was µ ∈ {−1, 0.6, 1}, the standard deviation of an aberration σ ∈ {0.2, 0.5} and the length of an aberration (i.e., number of clones) λ ∈ {2, 3, 5, 10, 20, 50, 75}. Simulations were repeated multiple times for each combination of parameters.

The Horlings Dataset
The dataset analyzed in this study was published by Horlings and colleagues [33]. Measurements of copy number changes were performed on microarrays containing 3.5 k Bacterial Artificial Chromosome (BAC) , P1-Derived Artificial Chromosome (PAC) DNA segments covering the entire genome with an average spacing of 1 Mb. Each BAC clone was spotted in triplicate on every slide (Code Link Activated Slides, Amersham Biosciences). Signal intensity measurements were captured using ImaGene Software (BioDiscovery, Inc.) and normalized by median print tip normalization. Intensity ratios (Cy5/Cy3) were log-transformed, and triplicate spot measurements were averaged. From a pool of 295 breast tumor specimens, 68 samples were selected to represent the most common molecular subtypes: Luminal A (n = 21) and Luminal B (n = 12), basal-like (n = 21), HER2-enriched, also known as ERBB2/HER2/NEU, and denoted by (HER2+) (n = 14). All samples contained 50% or more tumor cells. The raw data were not imputed; clone positions were outdated and, in some instances had, different clones associated with the same genomic position. Therefore, some preprocessing was required.
We found that the position of the clones reported in [33] did not match those in publicly-available databases. For instance, the position of the clone RP11 to 94L15n, which contains ERBB2, was reported to be at 35,065,321 bp on chromosome 17q in [33], but mapped to base pair position 37,812,853 in the ENSEMBL database. To address this issue, we remapped all clones according to the ENSEMBL database (built GRCch37). We found most clones located near or at the reported position; however, of the original 3277 clones in the Horlings study, we updated the position of 3021 clones and removed 256. We removed 122 clones that had no base pair information in the original publication or ENSEMBL. Ninety eight clones were in a chromosome different from that reported in the original publication, and eight clones were in the correct chromosome, but at a position located more than 5 × 10 6 bps away from the position reported in ENSEMBL. Finally, we removed 28 clones that were in the correct chromosomes, but had inconsistent relative positions with respect to their immediate neighboring clones. We imputed missing values using the algorithm called locally weighted scatterplot smoothing (lowess) [47]. Entries of clones that were mapped to the same locations were averaged.

Detection of Focal Copy Number Aberrations Using TAaCGH
Here, we extend the method initially proposed in [22,48] to analyze microarray data (see the Conclusions Section for a detailed explanation of the new features reported in this work). For a chosen section of m copy number values, TAaCGH associates a point cloud in an euclidean n-dimensional coordinate system (i.e., R n ), 1 < n < m. We illustrate this association by building a point cloud in R 3 from a section of copy number values {y 1 , y 2 , ...y m } (see Figure 1). Any three consecutive copy number values {y i , y i+1 , y i+2 } naturally define a point in R 3 with coordinates (y i , y i+1 , y i+2 ) (i.e., the first log ratio value, y i , is assigned to the x coordinate of the point in the point cloud; the second log ratio value, y i+1 , is assigned to the y coordinate of the point; and the third log ratio value, y i+2 , to the z coordinate of the point). This algorithm is well defined everywhere, except when i = m − 1, m (i.e., the last two copy number values of the section), since the third coordinate of the point in the cloud (in R 3 ) is not defined when i = m − 1, and neither the second nor the third coordinates are defined when i = m. To solve this problem, TAaCGH completes the missing entries by considering the first values of the section (i.e., {y 1 , y 2 }). In order to represent the entire section of copy number values as a single point cloud, TAaCGH uses a sliding window approach. Therefore, the point cloud generated by a section of consecutive copy number measurements {y 1 , y 2 , ...y m } is {(y 1 , y 2 , y 3 ), (y 2 , y 3 , y 4 ), (y 3 , y 4 , y 5 ), ...(y n−1 , y n , y 1 ), (y n , y 1 , y 2 )}. In Figure 1A, an idealized profile with gains (green), no changes (silver) and losses (blue) is shown. The associated point cloud in R 3 is shown in Figure 1B with the points connected by edges (see below for an explanation of the meaning of the edges). A number of features can be noticed when representing the data as a point cloud. First, the associated point cloud has an elliptical shape, because consecutive copy number values are correlated. In fact, when TAaCGH was applied to gene expression profiles, we observed that the associated clouds were spherical due to the lack of correlation between expression values of consecutive genes along the genome [48]. Second, consecutive gains are mapped to the octant with all positive values, consecutive losses to the octant with all negative values, and values containing combinations of positive and negative values are mapped to the other octants. Third, the higher the absolute value of the gain or loss, the further the corresponding points in the point cloud will be from the origin. Consequently, the noise in the data is mapped near the origin of the coordinates. Hypothetical copy number changes are colored in blue (losses) and green (gains). Non-significant changes are colored in silver. (B) Point cloud, with points connected by edges, associated with the aCGH profile using a sliding window approach The next step in the algorithm is to build a filtration of Vietoris-Rips simplicial complexes. The goal of this step is to build a segmented picture of the data from which topological properties can be inferred. Intuitively, a Vietoris-Rips simplicial complex is a generalization of a graph that is built as follows: for a point cloud in R n and a fixed small number (called the filtration coefficient), one defines an edge between two points in the cloud if the euclidean distance between the two points is less than or equal to . If n ≥ 2, then solid triangles are also part of the Vietoris-Rips simplicial complex, and a solid triangle between three points is included in the complex if the three points are connected by edges. This process is also valid in higher dimensions and is generalized by adding tetrahedra and higher dimensional minimal convex sets. It is evident that for any two values 1 < 2 , the associated simplicial complexes S 1 , S 2 satisfy S 1 ⊂ S 2 . Therefore, if one lets the filtration coefficient systematically increase 1 < 2 < 3 < ... < p , one obtains a filtration of simplicial complexes S 1 ⊂ S 2 ⊂ S 3 ⊂ ... ⊂ S p (see [24,25] for a detailed description). We propose that in the analysis of aCGH data, the associated filtration can be viewed as a continuous segmentation process that assigns the same copy number value to clones whose copy number value difference is less than . In other words, when two points in the point cloud connect, they become part of the same element in the simplicial complex. This identification of points in the point cloud can be interpreted as a segmentation step, where the clones generating the points in the point cloud are assigned the same copy number value.
The key property of this representation of the data is that it allows us to perform association studies between the phenotype of interest and the topological properties of the simplicial complexes. In this work, we have done so for the number of connected components (called the zeroth Betti number and denoted by β 0 ), a topological property that measures the number of detached subsets that make up a dataset. We chose to start with the number of connected components, because, as proposed in [22] and illustrated in Figure 1, the value of β 0 helps identify CNAs. Calculations of β 0 were done using the software jPlex [49].
By considering the values of β 0 across the filtration, one obtains a function β 0 ( ) that relates the number of connected components to each distance . Furthermore, given two sets of patients (test and control), it is natural to compute the average value of β 0 for each value of for the control and test set separately and to associate a p-value with the difference between the two measurements. TAaCGH identifies significant differences in β 0 with differences in copy number values between the two populations. Figure 2 presents the algorithm for detecting sections of CNAs using TAaCGH. (A) shows the ideogram of a chromosome in which the section to be analyzed has been highlighted in green. (B) shows the aCGH profiles for one of the sections for two patients taken from [7]. The profile on the left-hand side has an amplification, while the one on the right has no copy number changes. The sliding window approach described above will create a point cloud of copy number values in R n for each profile. Figure 2C shows the corresponding point clouds and the one-dimensional Vietoris-Rips simplicial complexes for the two profiles (for n = 2). The example on the left, corresponding to the profile with the copy number gain, clearly shows two large connected components. One component is located at the origin (red) and accounts for all clones with small copy number values. The second component, away from the origin (in yellow), contains the copy number values corresponding to the amplification. The point cloud on the right shows only one connected component at the origin. Hence, in TAaCGH, each patient is not only represented by his or her associated point cloud, but by his or her corresponding filtration, from which the topological invariants can be calculated for each value of .   The ideogram of chromosome 1 and the overlapping sections to be analyzed (in green). One section (black) is selected and analyzed as illustrated in the following panels. (B) Array CGH profiles of the chromosome section for two patients (test and control) taken from [7]. (C) aCGH data are mapped using a sliding window algorithm to the Euclidean two-dimensional coordinate system, R 2 , and a filtration of simplicial complexes is generated using values of the filtration parameter = 0.05, 0.10, 0.20. (D) For each , the average β 0 is computed for both the test sample (t , blue) and the control sample (c , red) and plotted into one single graph from which the test statistic S exp is calculated. The p-value is calculated using a permutation test and corrected using the false discovery rate (FDR), because of the multiple sections being tested.
For a group of patients of size m, TAaCGH calculates the value of β 0 for each patient i and the value of (denoted by β 0 (i, )) and then computes < β 0 ( ) >= 1 m β 0 (i, ) for i = 1, . . . , m, the average value of β 0 across all patients in the population. Since one obtains one < β 0 ( ) > for each , this average value naturally defines a function on , which we denote by < β 0 >. Based on the construction described, two populations of patients can be compared by identifying significant differences between their associated < β 0 > curves, which represent copy number changes present at one population and not the other. Figure 2D shows examples of < β 0 > curves for two samples: a test sample (in blue) containing profiles similar to the sample profile with the aberration and a control sample (in red) with no aberration at that particular location. The shape of the < β 0 > curves can be easily interpreted. For very small distances, every data point will contribute one component. As the distance increases, points that are at a distance less than connect, decreasing the number of components. Eventually, for a sufficiently large value of , all points connect to form a single connected component.
To test statistically-significant differences between the test and control < β 0 > curves, we used the sum of the squares of the differences in average < β 0 > across all values of , i.e., S exp = (t − c ) 2 for = 0, . . . , K, where t and c are the average number of connected components for the test set and the control set, respectively, and where K is the smallest number, such that t = c = 1. The null hypothesis tested was S exp = 0. In other words, there was no difference between the test and control < β 0 > curves. We defined the referent distribution using a standard permutation test between genotypes (i.e., aCGH profiles) and phenotypes (i.e., tumor subtype). Since this selection of p-values assigns one p-value per section, we corrected the final p-value using the false discovery rate (FDR) [50,51].
When analyzing tumor data, the test population consisted of a specific subtype, and the control population consisted of the remaining subtypes. Hence, both populations, test and control, had aberrations. In some cases, we found that the control < β 0 > curve had larger values than the test < β 0 > curve, suggesting that the control set had more CNAs at that particular location. These cases were not considered in our analysis, since they provided information of the control and not the test dataset.

Determining Significance of Specific Clones
TAaCGH determines a chromosome section that contains significant changes, but it does not identify specific clones with significant copy number changes or whether the change is an amplification or a deletion. To narrow down the search for clones with significant copy number changes and to identify whether copy number changes were gains or losses, we compared the mean copy number value at each clone between the control and the test population. Significance was assessed using a permutation test. Since both the test and control populations have CNAs, sometimes at identical locations, some chromosome arms were identified as significant using TAaCGH, but no clones were found significant. These few regions were still considered as significant, but classified as undetermined.

Detection of Full-Length Arm/Chromosome Section Aberrations
The topological approach of TAaCGH is designed to measure relative changes in copy number between a clone and its neighbors and, therefore, does not account for large-scale chromosome aberrations, such as full arm amplifications or deletions. To detect large-scale chromosome aberrations, we extended the topological method by measuring significant displacements of the center of masses of the point clouds. More specifically, the center of mass of the point cloud associated with each chromosome arm was calculated for each patient and averaged over all patients belonging to any given category. A p-value was assigned to the difference between the average value for the center of masses for both populations using the same permutation approach described above. We not only tested for significant differences between test and control, but also for significant displacements of the center of masses of the test population from the origin. This second test allowed us to drop those cases where the significant displacement was driven by the control population.

Validation of the Experimental Results
We took three different approaches to validate our findings: (1) we compared our findings to those reported in the original publication [33], using SIRAC [21], and with other related publications [31,36,52]; (2) those that were found in our study, but not in the original paper [33], were tested in a second dataset [32]; and (3) we performed an independent analysis of the original data using the program GISTIC [13]. To apply GISTIC to the dataset analyzed in this study, we first segmented each profile using circular binary segmentation [53]. After segmentation, GISTIC found the aberrations per patient, and we computed the percentage of patients of a given subtype with each aberration. Since the sample size for some subtypes was small, we reported only those aberrations that were present in at least 35% of the patients (the full table of aberrations detected by GISTIC can be found in the Supplementary Material). While we expected to have an overall agreement with GISTIC, we also expected GISTIC to detect extra aberrations, because TAaCGH is designed to detect CNAs that are specific to a given subtype.

Results and Discussion
In the following section, we present our results using TAaCGH. Simulation results, obtained using the methods described in Section 2, are all presented in Section 3.1, and analysis of aCGH data are presented in Section 3.2.

Simulation Results
We performed simulation studies to optimize the value of the parameters in TAaCGH. First, we determined the size of the window (dimension of the point cloud) for analyzing the data. Second, we estimated the sensitivity and specificity of TAaCGH for a fixed window size, and third, we estimated the length of each chromosome section to be analyzed (green bars in Figure 2A). Lastly, we tested the performance of the TAaCGH of regions containing aberrations in the test and the control sets.

Window Size
First, we investigated the role of the window size (or, in other words, the dimension in which the point cloud is embedded). The number of clones in the simulated section was 50 (instead of 100); the length of the aberrations (i.e., the number of clones in the aberration) λ = 2, 3, 5, 10, 20; the mean value of the aberration µ = −1, 1 and 0.6; and the standard deviation σ = 0.2 and 0.5. We considered all possible combinations of (λ, µ, σ) for window sizes D = 2, 5, 10, 15, 20, 35, 50 and obtained a p-value for each experiment. Each experiment was repeated 84 times. The vectors of p-values obtained by combining all of this information were used to compute correlations across dimensions. Table 1 shows the results. We observed that all values were highly correlated (≥0.84). Furthermore, all dimensions were consistent in their significance assignments (results not shown). We concluded that using D = 2 was enough to detect aberrations and performed the remainder of the studies at this dimension.

. Sensitivity and Specificity of TaACGH
To test the sensitivity and specificity of our method, we performed simulations analyzing the parameters that define the aberrations. Sensitivity was estimated with aberration parameters µ = −1, 0.6, 1, σ = 0.5 and λ = 2, 3, 5, 10, 20, 50, 75. Specificity was estimated by simulating cases and controls with µ = 0 and σ = 0.5 and obtained a 100% success rate. Each experiment was repeated at least 20 times for each combination of parameters; p-values were corrected by FDR. Results for sensitivity are shown in Figure 3. This figure shows that TAaCGH has excellent sensitivity when a segment has three or more consecutive copy number changes for both µ = 1 and µ = −1. This value sharply decreases when µ = 0.6 and λ < 5.

Size of the Chromosome Section
Next, we analyzed the effects of the size of the section under analysis. One expects that very large sections will produce poor results, since different aberrations may form topologically-indistinguishable point clouds. For example, two sections with one amplification each with the same mean, but at different locations produce identical point clouds. On the other hand, clouds with a small number of points are not expected to be very informative. We therefore computed the sensitivity and specificity when the point clouds were made of 20, 50, 80 and 100 points and the following parameters D = 2, µ = 1, σ = 0.5 and λ = 2. Each experiment was repeated 100 times. We observed that the larger the point cloud, the worse the sensitivity decreasing from 100% for 20 points to 62% for 100 points. Specificity was 100% for all cases. Figure 3. Sensitivity of TAaCGH using simulations. The chart shows the sensitivity of TAaCGH to the length and the mean value of the aberration λ and µ, respectively. Each row represents the different values of λ, and each color represents a different value of µ.

Performance of TAaCGH when Both Control and Test Population Have Overlapping Aberrations
In the last simulation study, we analyzed the performance of TAaCGH when both the control and the test set had a CNA at the same location. We considered a total of 60 patients in each category with fixed section size (= 20), standard deviation σ = 0.5 and dimension D = 2. The values of µ and λ ranged from {0.6, 1, −1} to {5, 10, 15}, respectively. For clarity, we denoted µ c and µ t the mean values of the CNA in the control and test group and by λ c and λ t the values of the length of the CNA for both sets. Results of our simulations are shown in Figure 4.
As illustrated in Figure 4A, TAaCGH identified the aberration in the test set in almost all cases when µ t > µ c , but never when µ c > µ t , since those cases produced < β 0 > curves in which the control set had higher values than the test set. When µ t = µ c , the performance was poor, as indicated in the bar chart on Figure 4B, almost independently of the length of the aberration.

Same mean for test and control
. Sensitivity and specificity of TAaCGH on sections with copy number aberrations (CNAs) in both the test and control groups. The chart shows the sensitivity of TAaCGH to the length and the mean value of the aberrations λ and µ, respectively. (A) Sensitivity when the mean for the test (µ t ) and the mean for the control (µ c ) are different. If the mean for the test group is larger than the one from the control, sensitivity was 98.3%. If, on the other hand, the mean from the control was larger than the test group, TAaCGH did not detect the aberration in the test group; (B) Poor sensitivity when both the test group and the control group have an aberration in the section with the same mean. When the length for the aberration in the test group (λ t ) has a medium size, that is 25% < λ t < 75%, from the size of the section, the sensitivity is 27.2%; otherwise, it will be close to 0%.

Results for Breast Cancer Subtypes
Samples were divided into subtypes; each subtype was considered separately as a test set, using the remaining subtypes as the control. In our analysis, chromosome arms were subdivided into overlapping sections; each section contained 20 clones, and any two consecutive clones overlapped 10 clones (Figure 2A). < β 0 > curves were calculated for the test and the control populations for a window size of n = 2. The obtained p-values were then corrected for multiple testing using FDR ( Figure 2D). Results obtained for the entire study are shown in Figures 5 and 6 and in Supplementary Files S1-S8. Each panel in Figures 5 and 6 shows the specific subtype with the location of the significant aberrations found by TAaCCH, SIRAC and GISTIC. Green entries mean amplifications, blue deletions and grey undetermined. Aberrations validated on an independent dataset are also color-coded in grey.
Supplementary Files S1-S6 show the statistics for each of the subtypes; S7 shows the results of the analysis using GISTIC; and S8 shows patients that were removed from the control set due to their large copy number values (i.e., outliers).  Figure 6. Summary results for basal-like subtype. The two panels show significant aberrations found by TAaCGH, SIRAC or GISTIC. Only results for TAaCGH that were validated in an independent dataset are shown. The frequency cut-off for GISTIC was 35%.
Amplifications are denoted in green, deletions in blue and undetermined in grey. Sections with both colors (blue and green) contained combinations of amplifications and deletions. Arms validated using TAaCGH on a second dataset [32] are also color coded in grey. * Significance when outliers were removed from the control set.

Analysis of Luminal Subtypes
We analyzed Luminal A and B subtypes separately. The Luminal A subtype clinically is associated with the most favorable disease prognosis among the molecular subtypes. It is commonly estimated by pathologic characteristics, namely the estrogen receptor (ER) +, progesterone receptor (PR) + and ERBB2/HER2 (−), with low proliferation and with a low number of chromosome aberrations. Commonly-observed aberrations in the Luminal A subtype include 1q, 8q, 8p, 11q gain, 16p and 16q loss [31,33,36,52,54]. Our topological analysis found a single significant region 11q22.1 to q23.2. Within this region, only the clone at position 100, 641, 187 was significantly amplified. Figure 7A shows the corresponding < β 0 > curves and 7B shows an example of a Luminal A aberrant profile at 11q.
Analysis of the displacement of the centers of mass for whole chromosome arms found 1q, 16p and 16q to be significant. Figure 7C shows the box plots for the displacements of the center of masses for chromosome arm 16q (for the Luminal A subtype and for the control set) together with a representative profile of a whole chromosome arm deletion for 16q ( Figure 7D). The three arms 1q, 16p and 16q were significant in the original study by Horlings, as well as in many other studies [32,35,36,52,55,56] and were also validated when applying TAaCGH to the data published in [32]. Interestingly GISTIC identified CNAs in 1q and 11q, as well as CNAs in 8p, 8q and 13q, but failed to identify 16p and 16q. Based on our simulations, we expected TAaCGH not to be able to identify all aberrations detected by GISTIC, since they were common to different subtypes. In this particular case, 8p was common to 51% of all patients across subtypes; 8q was common to 65%; and 13q was common to 63% (see Supplementary File S7).   Epsilon Log2 Ratio Next, we analyzed the Luminal B subtype. Luminal B patients are characterized by ER+, HER2− or HER2+, but have higher proliferation rates than Luminal A. Luminal B subtype cancers have generally worse prognosis than Luminal A. CNAs commonly observed in Luminal B patients include gains of 1q, 8p12 to p11, 8q, 11q13 to q14, 17q and 20q and losses in 1p, 8p, 13q, 16q, 17p and 22q (reviewed in [29,54]; see also [31,32,36,57,58]). Our topological analysis did not find any significant aberrations associated with this subtype. The displacement of the center of masses found 12q, but this arm was not confirmed on the validation dataset. A deeper analysis of the Luminal B set revealed that the initial significance was driven by Patient 378 with extremely large copy number values (see Supplementary File S6) and by a small cohort of Luminal B patients (n = 12). Supporting this conclusion is the fact that we did not find any significant aberrations after removing Patient 378.

Centers of mass for 16q
We performed several tests to understand why TAaCGH did not find the regions 1p31.3, 8p21.2 to p23.1 or 17q23.2 reported in the original study [33] and in other studies [31,32,36]. Since Luminal B is known to be a rather heterogeneous subtype and some of its CNAs may be shared across different subtypes [54], we hypothesized that some of these regions could be aberrant in more than one subtype. We therefore tested if the removal of specific subtypes from the control set would change our p-values. Upon removing the basal-like subtype from the control set, we found the significance of the regions 1p36.32 to p31.1, 4q24 to q27, 8p23.3 to p22, 8p22 to p11.1, 8q24.11 to q24.3, 9p24.3 to p21.1, 9q13 to q22.32, 9q31.1 to q33.1, 13q12.2 to q21.1, 13q31.1, 13q32.2, 21q11.2 to q22.3. TAaCGH identified specific amplified/deleted clones for the indicated significant regions, except for chromosome arms 4q, 8q and 21q, suggesting that these arms contain heterogeneous regions of CNAs that are common to several subtypes. Since 8p23.1 to p21.2 and 1p31.3 were reported in [33], they were not analyzed any further. Figure 8 shows the < β 0 > curves and patient profiles for regions 9p24.3 to p22.3 ((A) and (B)) and 13q12.2 to q21.1 ((C) and (D)). Figure 8B,D shows representative profiles that contain the significant CNAs reported in Table 2. To test whether these newly-identified regions were specific to [33], we performed a validation test using the dataset published in [32] and also compared it to those results obtained by GISTIC. All of the regions were validated in [32], although in some cases, small fragments within significant regions were not validated. This effect was most likely due to the higher resolution of the array. For instance within 1p36.32 to 31.1 we validated regions 1p36.32 to p36.22, 1p36.21 to p36.11, 1p35.3 to p35.2, 1p35.1 to p34.3, 1p34.2 to p31.1. Most of the regions, or regions in close proximity, not reported by [33], but found in our study, have been reported in other studies. For instance, 8p22 to p11, 8q, 9p, 9q, 13q and 21q have been reported as either focal or whole arm aberrations in [31,32,36,38,45,[59][60][61][62][63]. Significant clones for 8p are shown in Supplementary File S6. In agreement with TAaCGH, GISTIC also detected 8p, 8q and 13q (all common to more than 75% of the patients in the Luminal B subtype), but failed to detect 1p, 4q, 9p, 9q, 21q. On the other hand, GISTIC detected 1q, 3p, 3q, 6q, 11q, 17q and 18q. As expected, TAaCGH did not detect these CNAs, because they were common to the test and control set. For instance, 1q23.3 was common to 68%, and 1q41 was common to 60% of the patients across all subtypes. The remaining chromosome arms were shared by different amounts of patients, ranging from 25% for 6q to 43% for 3p. It may seem that 25% is a small number of patients; however, when we looked at the distribution of patients across subtypes, we found that half of these patients were in the Luminal B set (test set) and the other half in the HER2+ (control set; see the distribution of aberrations in the Supplementary File S7).   The positions of the significant clones within the identified significant regions were identified next and are presented in Table 2.
Lastly, we removed the HER2+ subtype from the control dataset, but no significant aberrations were detected. In conclusion, upon removal of the basal-like dataset from the control group, our study found all regions reported in [33], except 17q23.2, and found 10 other regions that were also validated in [32], three of which were validated by GISTIC. Next, we analyzed the ERBB2/HER2/NEU-enriched subtype. In HER2+ patients, overexpression of HER2 is commonly regulated by an amplification of the chromosome region containing the gene ERBB2 [64,65]. Only regions 17q11.1 to q12, 17q12 to q21.31 and 17q.21.31 to q22 were significant, and the corresponding < β 0 > curves are shown in Figures 9A-9C. Our study, in agreement the the original study [33] and others [32,36,66], found the location of ERBB2 (cytobands 17q11.1 to q12). Furthermore, in agreement with the study by Horlings, we found a region extending beyond ERBB2 and containing cytobands 17q21.2 to q22. Significant clones were located between base pair positions 37,258,265 to 38,428,492 and 48,120,796 to 48,817,562. Figure 9D shows the profile of a patient with amplifications at the significant regions.
These results were also validated in [32]. In particular, our validation study confirmed a significant amplification encompassing regions 17q12 to 17q21.2. GISTIC was consistent with these findings and also found that more than 35% of the patients in the HER2+ subtype had aberrations in 1q, 3p, 8p and 13q. Similarly to our previous remarks, these CNAs were not detected by TAaCGH, because they were common to a large percentage of the patients in the study (>43%). Epsilon Epsilon       Log2 Ratio

Results for the Basal-Like Subtype
The basal-like subtype is the most heterogeneous subtype and includes those that are termed triple negative, indicative of the absence of ER, PR and HER2 expression. This subtype is generally associated with the worst prognosis of the subtypes, perhaps in part due to lack of targeted therapies. Consistent with this heterogeneity, we found the basal-like subtype to have the highest number of CNAs in a total of 29 different regions. We present the significant sections of the genome found in our study according to our validation results i. CNAs in agreement with those reported in the original study We found the statistical significance of all arms reported in [33] (i.e., 4p, 5q, 6p, 10p, 10q and 15q) with the exception of 8q and 12q. From these, the only arms that were not significant by the center of masses were 12q and 15q. Our significant regions did not always match the regions reported in [33]. These discrepancies between both studies were either because of the updated position of the clones in our study or because TAaCGH found regions containing those reported in [33]. For example, we found the significance of the region 4p15.1 to p11 instead of the reported region 4p15.31. Region 4p15.31, however, became significant when the clones were placed following Horlings' original position. Another example was the loss of 5q12.3 to q13.2, where we found an overlapping region expanding from 5q11.1 to 5q13.1. We did not detect 8q initially. The whole 8q chromosome arm became significant (using the center of masses) upon removal of two Luminal patients (110 and 302) with profiles that were clearly different from the rest of patients in the subtype. Similarly, 12q became significant (with TAaCGH) upon removal of the Luminal B Patient 378 (the patient already identified as an outlier in the Luminal B study). Our study using GISTIC found 4p, 5q, 10p, 10q, but failed to identify 6p, 15q.
ii. CNAs not reported in the original study, but validated in [32] using TAaCGH We also detected regions that were validated in [32], but that were not reported in the original study [33]. These regions were 1p22.2 to p12, 1p36.32 to p31.1, 2p15 to p11.2 and most segments in 14q: 14q12 to q21.3 and 14q24.1 to q32.33. This region in 14q was large enough that the significance was also reflected in the analysis by the center of masses. Significant sections contained the cytobands 14q32 to q33 reported by [31,67], but partially missed the cytobands 14q22 to q23 reported by [32]. A detailed description of the significant clones in the basal-like subtype are presented in Table 3 and in Supplementary File S6. Figure 10 shows examples of the < β 0 > curves for chromosome arms 2p and 14q ((A) and (C)), a representative profile for chromosome 2p (B) and the displacement for the center of masses for 14q for the basal-like subtype (D). None of these regions were detected by our analysis using GISTIC.
iii. New CNAs not reported in the original study, not validated in [32], but confirmed by GISTIC Some regions were not validated in [32], but were detected by GISTIC. We report these regions separately, because they have also been reported in other studies; hence, we do not believe they are an artifact of the data.
1. Chromosome arm 1q: We found the region 1q23.1 to 31.1 to be aberrant, and GISTIC confirmed it to be an amplification. This region was large enough that the changes were also detected in the study using the center of masses. This region of the genome was previously reported in other studies [31,32,36]. 2. Chromosome arm 3p: Two regions in 3p were significant: 3p22.1 to p11.2 and 3p26.3 to p23.
Region 3p22.1 to p11.2 has been reported to be a loss in a number of studies, including [32,[68][69][70]. Additionally, we found a gain in 3p26.3 to p23. 3. Chromosome arm 3q: We found an amplification of the whole arm, while GISTIC found region 3q27.2. The gain of 3q is characteristic of BRCA1 deficiency in sporadic tumors, as well as in hereditary tumors (e.g., [71]). 4. Chromosome arm 6q: Three sections out of nine were found amplified in chromosome arm 6q.
These three regions expanded cytobands 6q24.1 to q27 and contain the estrogen receptor gene ERS1 located at 6q25.2. This finding was also detected in our study by the center of masses [69]. 5. Chromosome arm 12p: The region 12p13.3 was found to be amplified using our topological analysis, and the entire arm was also detected by the displacement of the center of masses. GISTIC detected a downstream region 12p33 to be amplified.
6. Chromosome arm 13q: Two main sections were found aberrant in the chromosome arm 13q. The first one expanding 13q12.2 to q31.2 and the second 13q31.2 to q34. We were unable to identify whether 13q12.2 to q31.2 was an amplification or a deletion; however, GISTIC identified a deletion in 13q14.11 in 81% of the patients. Additionally TAaCGH found an amplification in 13q31.2 to q34. Amplifications in chromosome 13q have been found in multiple subtypes [36] and more specifically in cytokeratin 14 (CK14) positive tumors, 25% of which are basal-like carcinomas [72]. 7. Chromosome arm 18q: A section extending 18q11.1 to q21.33 was significant, but TAaCGH was unable to identify whether it was an amplification or a deletion, suggesting an heterogeneous combinations of amplifications and deletions in the region across subtypes. GISTIC, on the other hand, identified a deletion in 18q12.2.    iv. New CNAs not reported in the original study, not validated in [32] using TAaCGH, but in agreement with other studies Three chromosome arms were neither identified by GISTIC, nor validated in [32]. These arms, however, have been reported previously in other works.
1. Chromosome arm 4q: We found most of 4q to be significant, with the exception of the centromere near regions 4q11 to q13.3. This finding is in agreement with [32,69], who reported a loss of 4q31 to q35. The aberration in 4q was large enough to also be detected by the center of masses. 2. Chromosome arm 9p: Cytobands from 9p24.3 to 9p21.3 were found to be significant. Gains were reported in [63,73]. 3. Chromosome arm Xp. The section containing Xp22.33 to 11.21, which contained all clones in the array, was found significant in [72].
v. New CNAs identified by TAaCGH, but not validated in [32] Some regions were found to be significant only on the original dataset. These included section 5p15.33 to p12 and were also significant when considering the whole chromosome arm 5p and sections 9q21.13 to q22.32 and 9q32 to q34.3. Since these regions were not validated and have scarcely been investigated previously, we cataloged them as artifacts of the data.
vi. CNAS identified by GISTIC alone GISTIC identified four chromosome arms that neither TAaCGH, nor SIRAC identified. There regions were an amplification in 7q34, a deletion in 8p23.2, a deletion in 11q24.3 and an amplification in 17q24.3. As discussed earlier, these regions were relatively common across different subtypes (see Supplementary File S7).

Conclusions
Array CGH provides an unparalleled opportunity to characterize disease-associated CNAs. Identification of these aberrations, however, is a difficult task, due to the heterogeneity of the diseases and the noise inherent to the microarray technologies. Here, we have presented a method called topological analysis of array CGH (TAaCGH), which complements currently-available methods. Other topological methods have been used in the identification of breast cancer subtypes using gene expression [74]; however, TAaCGH is, to our knowledge, the only supervised method that uses topological techniques to identify regions of copy number changes. In comparison to other methods that identify CNAs, TAaCGH incorporates a multi-resolution segmentation approach, modulated by the filtration parameter, and performs an association test between the topological properties of the point cloud and the phenotype under study.
TAaCGH is designed to analyze any type of time series provided that it is made of real-valued data. In previous works, we applied preliminary versions of TAaCGH to a different breast cancer aCGH dataset [22] and to a gene expression dataset [48], and in the future, we intend to extend it to other genomic data (see, for instance, [75]). Applications of TAaCGH to SNPs or sequencing data remain to be explored. In this study, we have extended our previous work by analyzing the statistical properties of TAaCGH, developing a new method to optimize the parameters in the program (specificity and sensitivity, dimensionality, size of the genomic section to be analyzed, comparison of test and control with CNAs at identical locations), identifying amplifications and deletions of specific clones within significant < β 0 > sections and incorporating the analysis of the center of masses for identifying whole arm amplifications/deletions.
We found almost all aberrations reported in the initial study [33] and 31 more regions that were missed in the original study. We validated many of the regions by testing TAaCGH on a second dataset [32] and by comparing our results with those obtained by GISTIC. Although the gene expression data in [33] was not made available to us, we were able to find some possible significance to the regions found using the database Catalogue of Somatic Mutations in Cancer (COSMIC) [76]. For instance, we found 11q22.1 to 23.2 in Luminal A. This aberration was not found in the original study by Horlings, was not validated in the dataset of [32], but was confirmed by GISTIC. Interestingly, 11q22.1 to q23.2 corresponds to the position of the progesterone receptor (PR) whose overexpression is commonly associated with Luminal A tumors. Supporting this finding, although not conclusive, we found a significant association between patients with an amplification in 11q22.1 to 23.2 and the reported PR status (Fischer exact test p = 0.03). This result suggests that the PR gene can be regulated by changes in copy number.
We found significant aberrations on the Luminal B subtype only upon removal of the basal-like subtype from the control set. Besides those aberrations that were reported in the original study (i.e., 1p36.32 to 31.1 and 8p23.3 to 23.1) [33] we also found the following: (1) Region 4q24 to q27 for which we did not find a specific amplification or deletion common to all patients, suggesting an heterogeneous pattern of amplification and deletions across the profiles. Inspection of the profiles revealed a large deletion in some patients [77] combined with a focal gain between positions 112,097,383 and 116,102,599. Analysis of the COSMIC database of these cytobands found gene U GT 8 associated with malignancy and lung metastasis [78,79]. (2) Region 8p12 contains genes whose loss has been associated with breast cancer progression (KAT6A, PURG, WRN, NRG1) [80][81][82][83]. (3) 9p was driven by a combination of deletions and amplifications. We found a gain at 6,004,718 in 9p24.1, a region that contains multiple proto-oncogenes related to breast cancer (i.e., GASC1 UHRF2, KIAA1432 and C9orf123) [63], as well as losses in positions 9,668,611, a region that contains the single gene PTPRD that has been associated with poor prognosis and metastasis in cancer [84,85]. Furthermore, the region 9p21.1 to q23 has been reported to be lost in breast cancer [86]. (4) We also found large deleted regions in chromosomes 9q and 13q, which contain multiple cancer genes, together with an amplification in 9q31.1. This amplification contains the gene SMC2, which has been associated with poor prognosis [87]. (5) A gain at 43,635,239 in 21q22.3 was also found. Our study did not find 17q. We believe that this was the result of having a small sample size for the Luminal B subtype and of having several patients in the control group with aberrations in the same region.
Basal-like tumors revealed a wide variety of aberrations, and in our study, we found all aberrations reported in the original study (chromosome arms 8q and 12q were found by removing three patients with very high copy number values; see Supplementary File S8) and 19 more aberrations. Out of these 19 aberrations, seven were also confirmed by GISTIC, and three were validated in a second independent dataset. Most of the others had been reported in other studies. These three new CNAs were found in chromosome arms 1p, 2p and 14q and are described in Table 3. Most significantly, we found intermittent regions of gains and losses between 1p36.32 and 1p13.1, a gain of 2p15 to p11.2 and losses of 14q12 to q21.3 and 14q24.3 to q32.22. We were unable to obtain any meaningful information from the COSMIC database, however, because these regions are large and dense in cancer genes.
In conclusion, topological approaches provide an alternative method for data analysis, and in this work, we have shown how it can help uncover chromosome aberrations in aCGH data. One important feature of this topological approach is that it can be extended in multiple directions. For instance, the work of Perea and Harer [23] and our own work [88,89] suggest that the number of two-dimensional holes in the data (denoted by β 1 ) can be used to identify periodic patterns in the data. We are currently working on such an analysis, but are unable to provide any results yet, since a new statistical framework needs to be developed. Our preliminary studies suggest that certain co-occurring aberrations can be detected using this topological invariant. Other possible extensions include different strategies in the generation of the point cloud (instead of a delay time embedding algorithm) or the incorporation of non-euclidean measures in the definition of the point cloud.