Functionally Coherent Transcription Factor Target Networks Illuminate Control of Epithelial Remodelling and Oncogenic Notch

Background Cell identity is governed by gene expression, regulated by Transcription Factor (TF) binding at cis-regulatory modules. Decoding the relationship TF binding patterns and the regulation of cognate target genes is nontrivial, remaining a fundamental limitation in understanding cell decision-making mechanisms. Identification of TF physical binding that is biologically ‘neutral’ is a current challenge. We studied cell identity in the context of Epithelial to Mesenchymal Transition (EMT), a cell programme fundamental for normal embryonic development that contributes to tumour progression and fibrosis. Results We developed the NetNC software for discovery of functionally coherent TF targets. NetNC was applied to analyse gene regulation by the EMT TFs Snail, Twist in early embryogenesis and also to modENCODE ‘HOT’ regions. Predicted neutral binding accounted for 50% to ≥80% of candidate target genes assigned from significant binding peaks. Novel gene functions and network modules were identified, including regulation of chromatin organisation and crosstalk with notch signalling. Orthologues of predicted TF targets discriminated breast cancer molecular subtypes and NetNC analysis predicted new gene functions; for example, evidencing networks that reshape Waddington’s landscape during EMT-like phenotype switching. Predicted invasion roles for SNX29, ATG3, UNK and IRX4 were validated using a tractable cell model. Conclusions We found extensive neutral TF binding across the nine datasets examined and showed that NetNC performs well in identifying functionally coherent targets. HOT regions had comparatively high functional coherence. Our results illuminate conserved molecular networks that regulate epithelial remodelling in development and disease, with potential implications for precision medicine.


INTRODUCTION
Transcriptional regulatory factors (TFs) govern gene expression, which is a crucial determinant of phenotype. Mapping transcriptional regulatory networks is an attractive approach to understand the molecular mechanisms underpinning both normal biology and disease (Rhee et al., 2014;Shlyueva et al., 2014;Stampfel et al., 2015). TF action is controlled in multiple ways; including proteinprotein interactions, DNA sequence affinity, 3D chromatin conformation, post-translational modifications and the processes required for TF delivery to the nucleus (Khoueiry et al., 2017;Rhee et al., 2014;Zabidi and Stark, 2016). A complex interplay of mechanisms influences TF specificity across different biological contexts and genome-scale assignment of TFs to individual genes is challenging (Khoueiry et al., 2017;Shlyueva et al., 2014;Wilczynski and Furlong, 2010).
TF binding sites may be determined using chromatin immunoprecipitation followed by sequencing (ChIP-seq) or microarray (ChIP-chip). These and related methods (e.g. ChIP-exo, DamID) have revealed a substantial proportion of statistically significant 'neutral' TF binding, that has apparently no effect on transcription from assigned target genes (Biggin, 2011;Li et al., 2008;Ozdemir et al., 2011;Shlyueva et al., 2014). Genomic regions that bind large numbers of TFs are termed Highly Occupied Target (HOT) regions (Roy et al., 2010), are enriched for disease SNPs and can function as developmental enhancers (Kvon et al., 2012;Li et al., 2015). However, a considerable proportion of individual TF binding events at HOT regions may have little effect on gene expression and association with chromatin accessibility suggests non-canonical regulatory function such as sequestration of TFs or in 3D genome organisation (Montavon et al., 2011;Moorman et al., 2006) as well as possible technical artefacts (Teytelman et al., 2013). Apparent neutral binding events may also have subtle functions; for example in combinatorial contextspecific regulation or in buffering noise (Cannavò et al., 2016;Stampfel et al., 2015). Identification of bona fide, functional TF target genes remains a major obstacle in understanding the regulatory Figure 1. Overview of the NetNC algorithm. NetNC input data may be candidate TF target genes and a functional gene network (top, left). However, NetNC may be applied to any gene list and reference network. Hypergeometric Mutual Clustering (HMC) p-values are calculated for candidate TF target genes (top, middle); the node numbers and colours in the HMC graph correspond to those in the contingency table. HMC p-values are then employed in either i) a node-centric analysis mode (NetNC-FBT, right top) or ii) an edge-centric mode (NetNC-FTI) using global False Discovery Rate (pFDR, middle) followed by iterative minimum cut (bottom). NetNC can estimate local FDR (lcFDR) to predict the proportion of neutrally bound candidate target genes (left). NetNC can produce pathway-like clusters and also biologically coherent node lists for which edges may be taken using a standard FDR or Family Wise Error Rate (FWER) threshold on the HMC p-values (right). in columns, and sixteen %STNG values were analysed (5% to 80%, x-axis). NetNC performed best on the data examined with typically lower FPR and higher MCC values. Error bars reflect 95% confidence intervals calculated from quantiles of SNTG resamples. Also see Table S2.
PNBP was similar for sites derived from either the union or intersection of two Twist antibodies, although the NetNC-FTI method found a higher number of functional targets for the intersection of antibodies (30.5% (116/334) vs 23% (424/1848)). Hits identified by multiple antibodies may be technically more robust due to reduced off-target binding (Sandmann et al., 2007). However, taking the union of candidate binding sites could eliminate false negatives arising from epitope steric occlusion due to protein interactions. Similar PNBP for either the intersection or the union of Twist antibodies suggests that, despite expected higher technical specificity, the given based on 'Functional Target Identification' (NetNC-FTI) and mean local FDR (NetNC-lcFDR) calibrated against datasets with a known proportion of resampled Synthetic Neutral Target Genes (SNTG) described in Methods section 4.2.3. The above datasets correspond to the following developmental stages: 2-4h stages 4-9 (except '2-4h_intersect datasets which were stages 5-7 (Sandmann et al., 2007)); 2-3h stages 4-6; 1-3h stages 2-6; 4-6h stages 8-9 (Sandmann et al., 2007); 0-12h stages 1-15; gastrulation occurs at stage 6 (Campos-Ortega and Hartenstein, 1997). Also see Figure S2.  recovered at lcFDR<0.05 or pFDR<0.05 ( Figure 3). Even datasets with high PNBP had candidate target genes that passed stringent NetNC FDR thresholds; for example, sna_2-3h_union, twi_2-3h_union respectively had the highest and second-highest proportion of candidate targets at lcFDR<0.05 ( Figure 3B). We found no evidence for benefit in using RNA polymerase binding data to guide allocation of peaks to candidate target genes (datasets sna_2-3h_union, twi_2-3h_union).
The twi_2-4h_Toll 10b , sna_2-4h_Toll 10b datasets had a relatively low peak threshold (two-fold enrichment), which may contribute to the high PNBP for sna_2-4h_Toll 10b . We note that PNBP values might systematically overestimate neutral binding because some functional targets could be missed; for example due to errors in assigning enhancer binding to target genes and in bona fide regulation of genes that have few DroFN edges with other candidate targets. Additionally, calibration of lcFDR values against KEGG synthetic data might influence neutral binding estimates, due to potential differences in network properties between TF targets and KEGG pathways.
ChIP peak intensity putatively correlates with functional binding, although some weak binding sites are functional (Biggin, 2011;Chen et al., 2013). NetNC NFCS values and ChIP peak enrichment scores were significantly correlated in 6/8 datasets (q<0.05, HOT regions not analysed).

Genome-scale functional transcription factor target networks
NetNC results offer a global representation of the mechanisms by which Snail and Twist exert tissue-specific regulation in early D. melanogaster embryogenesis (Figure 4, Figure S3, Supplemental File 1). Networks were annotated using GO and FlyBase (Ashburner et al., 2000;Gramates et al., 2017;Huang et al., 2009;Maere et al., 2005), revealing eleven biological groupings common to ≥4/9 TF_ALL datasets (Table S3). NetNC results were robust to subsampled input and, as expected, higher variance was found at lower subsampling rates (Table S4).
NetNC analysis found Snail and Twist regulation of multiple core cell processes that govern the global composition of the transcriptome and proteome (Figure 4, Figure S3). These processes included transcription, chromatin organisation, RNA splicing, translation and protein turnover. A 'Developmental Regulation Cluster' (DRC) was identified in every TF_ALL dataset and contained members of multiple key conserved morphogenetic pathways, including notch, wnt and fibroblast growth factor (FGF). Notch signalling modifiers from public data (Guruharsha et al., 2012) overlapped significantly with NetNC-FTI results for each TF_ALL dataset (q <0.05), including the DRC, chromatin organisation and mediator complex clusters ( Figure 4, Figure S3). Notch was an important control node across TF_ALL, with highest betweenness centrality in the DRC for three datasets and ranked among the top ten DRC genes for 8/9 datasets. Activation of Notch can result in diverse, context-specific transcriptional outputs and the mechanisms regulating this pleiotropy are not well understood (Bray, 2016;Guruharsha et al., 2012;Nowell and Radtke, 2017;Ntziachristos et al., 2014). Our results provided functional context for many Notch modifiers and proposed signalling crosstalk mechanisms in cell fate decisions driven by Snail and Twist, where regulation of modifiers may control the consequences of Notch activation. Crosstalk between Notch and twist for human orthology (bold node border) and Notch screen hits (triangular nodes). Orthologues assigned basal-like (BL) or normal-like centroids (NL) are shown in red, green respectively; otherwise, node colour indicates differential expression NL (blue) vs BL (orange) subtypes (q<0.05) or no annotation (grey). Also see Figure S3, Tables S3 to S6 and Supplemental File 1. Panel A: twi_2-3h_union. Predicted functional targets cover several areas of fundamental biochemistry including splicing, DNA replication, energy metabolism, translation and chromatin organisation.
Regulation of multiple conserved processes by Twist is consistent with the extensive cell changes required during mesoderm development. Clusters annotated predominantly to either NL or BL subtypes include mitochondrial translation (BL) and the proteasome (NL). These results predict novel Twist functions, for example in regulation of mushroom body neuroblast proliferation factors. or snail was previously shown in multiple systems, for example in adult myogenic progenitors (Bernard et al., 2010) and hypoxia-induced EMT (Sahlgren et al., 2008 ranking within the top ten DRC genes in six TF_ALL datasets and was twice top-ranked. Therefore Notch and wingless were key control points regulated by Snail, Twist in the mesoderm specification network. Thirteen DRC genes were present in ≥7 TF_ALL datasets (DRC-13, Table S5), and had established functions in the development of mesodermal derivatives such as muscle, the nervous system and heart (Baylies and Bate, 1996;Bernard et al., 2010;Bray, 2016;Chen et al., 1996;Lo et al., 2002;Trujillo et al., 2016;Xie et al., 2016). Supporting NetNC-FTI predictions, in situ hybridisation for DRC-13 indicated earliest expression in (presumptive) mesoderm at: stages 4-6 (wg, en, twi, N, htl, how), stages 7-8 (rib, pyd, mbc, abd-A) and stages 9-10 (pnt) (BDGP; Hammonds et al., Hartley et al., 1987;Tomancak et al., 2002). The remaining two DRC-13 genes had no evidence for mesodermal expression (fkh) or no data available (jar). However, fkh is essential for caudal visceral mesoderm development (Kusch and Reuter, 1999) and jar is expressed in midgut mesoderm (Millo and Bownes, 2007).
Six TF_ALL datasets had brain development clusters. Snail regulation of neural genes is consistent with its repression of ectodermal (neural) genes in the prospective mesoderm (Gilmour et al., 2017;Leptin, 1991;Wieschaus and Nüsslein-Volhard, 2016). Additionally, Snail is important for neurogenesis in fly and mammals (Ashraf and Ip, 2001;Zander et al., 2014). Therefore, binding to neural functional modules might reflect potentiation for rapid activation in combination with other transcription factors within neural developmental trajectories (Nevil et al., 2017;Sandmann et al., 2007). NetNC results predict novel Twist functions, for example in activation or repression of mushroom body neuroblast proliferation factors retinal homeobox, slender lobes, and taranis. The mushroom body is a prominent structure in the fly brain, important for olfactory learning and memory (Caron et al., 2013). Twist is typically a transcriptional activator (Gilmour et al., 2017) although may contribute to Snail's repressive activity (Lin et al., 2015). Indeed, Twist-related protein 1 repressed Cadherin-1 in breast cancers (Vesuna et al., 2008) Breast cancer subtype is characterised by differential expression of orthologous

Snail and Twist functional targets
Integration of NetNC predictions, Notch screens and breast cancer transcriptomes enabled analysis of conserved molecular networks that orchestrate epithelial remodelling in development and cancers. NetNC-FTI Snail and Twist targets included known cancer genes and also predicted novel drivers ( Figure 4, Figure S3, Tables S3, S5, S6). The fly genome is relatively tractable for network studies, while data availability (e.g. ChIP-chip, ChIP-seq, genetic screens) is enhanced by both considerable community resources and the relative ease of experimental manipulation (Mohr et al., 2014). Breast cancer intrinsic molecular subtypes with distinct clinical trajectories have been extensively validated and complement clinico-pathological parameters (Cejalvo et al., 2017;Sørlie et al., 2003). These subtypes are known as luminal-A, luminal-B, HER2-overexpressing, normallike and basal-like. While more recent studies have classified more subtypes, for example identifying ten groups (Curtis et al., 2012), the five subtypes employed in our analysis had been widely used, extensively validated, exhibited clear differences in prognosis, overlapped with subgroups defined using standard clinical markers (ESR1, HER2), and aligned with distinct treatment pathways (Cejalvo et al., 2017;Sørlie et al., 2003). NetNC-FTI networks for all nine TF_ALL datasets overlapped with known cancer pathways, including significant enrichment for Notch modifiers (q<0.05). Aberrant activation of Notch orthologues in breast cancers had been demonstrated, and linked with EMT-like signalling, particularly for basal-like and claudin-low subtypes (Barnawi et al., 2016;Ingthorsson et al., 2016;Stylianou et al., 2006). We hypothesised that orthologous genes from Snail and Twist functional targets would stratify breast cancers into clinically meaningful groups.
Basal-like tumours were characterised by EN1 and NOTCH1, aligning with previous findings (Barnawi et al., 2016;Beltran et al., 2014;Stylianou et al., 2006). Notch signalling modulation is a promising area for cancer therapy (Ntziachristos et al., 2014) and orthologues of Notch modifiers identified in our analysis provide a pool of candidates that could potentially inform development of companion diagnostics or combination therapies for agents targeting the notch pathway in basal-like breast cancers. Elevated ETV6 expression was also a feature of basal-like cancers, where copy number amplifications and recurrent gene fusions were previously reported (Adélaïde et al., 2007;Letessier et al., 2005). The Luminal A subtype (feature_LumA) had similarities with luminal B (feature_LumB 2 , ERBB3, MYO6) and normal-like (DOCK1, ERBB3, MYO6) tumours. High BMPR1B expression was a defining feature of the luminal A subtype, in agreement with oncogenic BMP signalling in luminal epithelia (Chapellier et al., 2015). BMP2 may be pleiotropic, promoting EMT characteristics in some contexts (Ma et al., 2005;Ren and Dijke, 2017). BMP2 expression was typically high in basal-like tumours and low in luminal cancers, aligning with BMP2 with low %P values and to samples from a single dataset (Popovici et al., 2010). Some genes were annotated to more than one feature and reciprocal patterns of gene expression were found. For example, BMPR1B, ERBB3 and MYO6 were strongly upregulated in feature LumA but downregulated in basal-like and HER2-overexpressing cancers.
Unexpectedly, feature NL (normal-like) had high expression of canonical EMT drivers, including SNAI2, TWIST and QKI. Some of the EMT genes in feature NL were also highly expressed in many basal-like tumours, while genes in feature Bas (NOTCH, SERTAD2) were upregulated in normal-like tumours. Also see Figure S4.
upregulation as a feature of the EMT programme in basal-like cancers. Several genes were highly expressed in both feature_LumB 1 and ESR1 negative subtypes (feature_ERneg), including ECT2, SNRPD1, SRSF2, CBX3; our data suggest that these genes might contribute to the worse survival outcomes for luminal B relative to luminal A cancers. (Sørlie et al., 2001(Sørlie et al., , 2003. SNRPD1 and SRSF2 function in splicing (Bermingham et al., 1995) which was linked to survival of multiple basal-like breast cancer cell lines (Chan et al., 2017). CBX3 and ECT2 were correlated with poor prognosis (Liang et al., 2017;Wang et al., 2018). These genes typically had low expression in Luminal A and normal-like subtypes. Feature_LoExp represents genes with low detection rates and tumours populating feature_LoExp are a mixture of subtypes, largely from a single study (Popovici et al., 2010). Key EMT genes (SNAI2, TWIST1, QKI) were assigned to the NL centroid and had highest relative expression in normal-like tumours (feature_NL, Figure 5). Feature_NL also included homeobox transcription factors (HOXA9, MEIS2) and a secreted cell migration guidance gene (SLIT2). Some genes had high expression in both normal-like and basal-like cancers, including QKI, which regulates circRNA formation in EMT (Conn et al., 2015), and the FZD1 wnt/β-catenin receptor. Moreover, genes in feature_Bas and feature_NL clustered together, reflecting expression similarities between normal-like and basal-like subtypes. EMT may confer stem-like cell properties (DiMeo et al., 2009;Mani et al., 2008;Schmidt et al., 2015) and our results were consistent with dedifferentiation or arrested differentiation due to activation of an EMT-like programme in NL cancers. Previous work found stem cell markers in normal-like cancers (Sieuwerts et al., 2009;Sørlie et al., 2001) and SNAI2 was critical for maintenance of mammary stem cells and linked with 20 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint a stem-like signature in breast cancers (Guo et al., 2012;Lawson et al., 2015). Cell-compositional effects, associated with high stromal content in NL tumours (Prat and Perou, 2011), could also explain EMT-like molecular characteristics. We also found many differences between normal-like and basal-like cancer subtypes ( Figure 4, Figure S3). Our results demonstrated that NetNC functional targets from fly mesoderm development capture clinically relevant molecular features of breast cancers and proposed novel candidate drivers of tumour progression.

Integrating NetNC functional target networks and breast cancer transcriptome profiling
Orthologous basal-like and normal-like genes were annotated onto NetNC-FTI networks, offering a new perspective on the molecular circuits controlling these different subtypes ( Figure 4, Figure S3).
We focussed on basal-like and normal-like cancers, which accounted for the large majority of networks studied and were prominent in centroid and heatmap analysis ( Figure 5, Figure S4). Furthermore, EMT had been shown to be important for basal-like breast cancer biology (Sarrió et al., 2008) and, interestingly, key EMT genes were assigned to the normal-like subtype. Splicing factors and components of the ribosome were associated with the normal-like subtype in three networks (twi_2-4h_intersect, twi-2-6h_intersect, twi_2-3h_union); additionally, the proteasome and proteasome regulatory subunits in the twi_2-3h_union network had many genes annotated to the normal-like subtype.
The sna_2-4h_Toll 10b 'RNA degradation and transcriptional regulation' cluster was exclusively annotated to the basal-like subtype and included HECA, which may function as a tumour suppressor (Lin et al., 2013) and an oncogene (Chien et al., 2006). HECA was also identified in NetNC-FTI analysis of twi_2-4h_intersect and twi_4-6h_intersect; these datasets had Twist binding at different, non-contiguous sites that were both assigned to hdc, the fly HECA orthologue. Hdc was a multifunctional Notch signalling modifier, including in cell survival 21 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint (Resende et al., 2017) and tracheal branching morphogenesis, activated by escargot (Steneberg et al., 1998). HECA was upregulated in basal-like relative to normal-like tumours (p<3.3x10 -23 ). Taken together, these data support participation of HECA in an EMT-like gene expression programme in basal-like breast cancers.
Chromatin organisation clusters frequently associated with basal-like annotations. For example, the twi_2-3h_union 'chromatin organisation and transcriptional regulation' cluster had six basal-like genes, including three Notch modifiers (ash1, tara, Bap111). These were orthologous to the ASH1L histone methyltransferase and candidate poor prognosis factor with copy number amplifications in basal-like tumours (Liu et al., 2014); the SERTAD2 bromodomain interacting oncogene and E2F activator (Cheong et al., 2009); and SMARCE1, a core subunit of the SWI/SNF chromatin remodelling complex that regulated ESR1, interacted with HIF1A signalling and potentiated breast cancer metastasis (García-Pedrero et al., 2006;Sethuraman et al., 2016;Sokol et al., 2017). Notch can promote EMT-like characteristics and mediated hypoxia-induced invasion in multiple cell lines (Sahlgren et al., 2008). Consistent with these studies, our work supported 22 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint conserved function for SMARCE1 in EMT-like signalling, both in mesoderm development and basal-like breast cancers, possibly downstream of NOTCH1 and through regulation of SWI/SNF targeting. Indeed, SWI/SNF controled chromatin switching in oral cancer EMT (Mohd-Sarip et al., 2017). Taranis, orthologous to SERTAD2, also functions to stabilise the expression of engrailed in regenerating tissue (Schuster and Smith-Bolton, 2015). The engrailed orthologue EN1 was the clearest single basal-like biomarker in the data examined ( Figure 5) and acts as a survival factor (Beltran et al., 2014). SERTAD2 could cooperate with EN1 in basal-like cancers; our results evidence coordinated expression of these two genes as part of a gene expression programme controlled by EMT TFs. Regulation of EN1, SERTAD2 within an EMT programme could harmonise previous results demonstrating key roles for both neural-specific and EMT TFs in basal-like breast cancers (Beltran et al., 2014;Sarrió et al., 2008). Therefore, we identify chromatin organisation factors downstream of Snail and Twist with orthologues that may control Notch output and breast cancer progression through a chromatin remodelling mechanism. Indeed, NetNC results predict components of feedback loops where EMT TFs regulate chromatin organisation genes that, in turn, may both reinforce and coordinate downstream stages of gene expression programmes for mesoderm development and cancer progression. Stages of the EMT programme were described elsewhere, reviewed in (Nieto et al., 2016); our results map networks that may control the remodelling of Waddington's landscape -identifying crosstalk between Snail, Twist, epigenetic modifiers and regulation of key developmental pathways (Hemberger et al., 2009). Dynamic interplay between successive cohorts of TFs and chromatin organisation factors could be an attractive mechanism to determine progress through and the ordering of steps in (partial) EMTs, consistent with 'metastable' intermediate stages (Nieto et al., 2016).

23
. CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint

Novel Twist and Snail functional targets influence invasion in a breast cancer model of EMT
NetNC results predicted new gene functions in EMT and cell invasion, including SNX29 (also known as RUNDC2A), ATG3, IRX4 and UNK. We investigated the functional and instructive role of these genes in an established invasion model (Dhasarathy et al., 2007). MCF7 cells are weakly invasive (Lacroix and Leclercq, 2004), thus the SNAI1-inducible MCF7 cell line was well suited to study alteration in expression of the selected genes in terms of their influence on invasion in conjunction with SNAI1 induction, knockdown or independently ( Figure 6). Potential artefacts associated with changes in cell growth or proliferation are controlled within the transwell assays used, because values reflect the ratio of signal from cells located at either side of the matrigel barrier.
Over-expression of IRX4 significantly increased invasion relative to controls in all conditions examined and IRX4 had high relative expression in a subset of basal-like breast cancers ( Figures 5, 6). IRX4 is a homeobox transcription factor involved in cardiogenesis, marking a ventricular-specific progenitor cell (Nelson et al., 2016) and is also associated with prostate cancer risk (Xu et al., 2014). SNX29 belongs to the sorting nexin protein family that function in endosomal sorting and signalling (Marat and Haucke, 2016). SNX29 is poorly characterised and ectopic expression significantly reduced invasion in a SNAI1-dependent manner ( Figure 6). Since we obtained these results, SNX29 downregulation was associated with metastasis and chemoresistance in ovarian carcinoma (Zhu et al., 2015), consistent with SNX29 inhibition of invasion driven by Snail. ATG3 is an E2-like enzyme required for autophagy and mitochondrial homeostasis (Doherty and Baehrecke, 2018), ATG3 overexpression significantly increased MCF7 invasion. Knockdown of ATG3 reduced invasion in hepatocellular carcinoma (Li et al., 2013). UNK is a RING finger protein homologous to unkempt which binds mRNA, functions in ubiquitination and was upregulated in gastrulation (Mohler et al., 1992). Others reported that UNK mRNA binding controls neuronal 24 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint morphology and can induce spindle-like cell shape in fibroblasts (Murn et al., 2015(Murn et al., , 2016. UNK significantly increased MCF7 invasion both independently of and additively with Snail; supporting a potential role in breast cancer progression. Indeed, UNK was overexpressed in cancers relative to controls in ArrayExpress (Parkinson et al., 2009). These in vitro confirmatory results both support the novel analysis approach and evidence new function for the genes examined.

25
. CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint

A Comprehensive D. melanogaster functional gene network (DroFN)
A high-confidence, comprehensive Drosophila melanogaster functional network (DroFN) was developed using a previously described inference approach (Overton et al., 2011). Functional interaction probabilities, corresponding to pathway co-membership, were estimated by logistic regression of Bayesian probabilities from STRING v8.0 scores (Jensen et al., 2009) and Gene Ontology (GO) coannotations (Ashburner et al., 2000), taking KEGG (Kanehisa et al., 2010) pathways as gold standard.
Gene pair co-annotations were derived from the GO database of March 25th 2010. The GO Biological Process (BP) and Cellular Component (CC) branches were read as a directed graph and genes added as leaf terms. The deepest term in the GO tree was selected for each gene pair, and BP was given precedence over CC. Training data were taken from KEGG v47, comprising 110 pathways (TRAIN-NET). Bayesian probabilities for STRING and GO coannotation frequencies were derived from TRAIN-NET (Overton et al., 2011). Selection of negative pairs from TRAIN-NET using the perl rand() function was used to generate training data with equal numbers of positive and negative pairs (TRAIN-BAL), which was input for logistic regression, to derive a model of gene pair functional interaction probability: Where: pGO is the Bayesian probability derived from Gene Ontology coannotation frequency pSTRING is the Bayesian probability derived from the STRING score frequency The above model was applied to TRAIN-NET and the resulting score distribution thresholded by seeking a value that maximised the F-measure (van Rijsbergen, 1979) and True Positive Rate

26
. CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint (TPR), while also minimising the False Positive Rate (FPR). The selected threshold value (p  (Yu et al., 2008) were assessed against TEST-NET (Table S1, Figure S1).

Network neighbourhood clustering (NetNC) algorithm
NetNC identifies functionally coherent nodes in a subgraph S of functional gene network G (an undirected graph), induced by some set of nodes of interest D; for example, candidate transcription factor target genes assigned from analysis of ChIP-seq data. Intuitively, we consider the proportion of common neighbours for nodes in S to define coherence; for example, nodes that share neighbours have greater coherence than nodes that do not share neighbours. The NetNC workflow is 27 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint summarised in Figure 1 and described in detail below. Two analysis modes are available a) nodecentric (parameter-free) and b) edge-centric, with two parameters. Both modes begin by assigning a p-value to each edge (S ij ) from Hypergeometric Mutual Clustering (HMC) (Goldberg and Roth, 2003), described in points one and two, below. 3. The NetNC edge-centric mode employs positive false discovery rate (Storey, 2002) and an iterative minimum cut procedure (Ford and Fulkerson, 1956) to derive clusters as follows: a) Subgraphs with the same number of nodes as S are resampled from G, application of steps 1 and 2 to these subgraphs generates an empirical null distribution of neighbourhood clustering p-values (H 0 ). This H 0 accounts for the effect of the sample size and the structure of G on the S ij hypergeometric p-values (p ij ). Each NetNC run on TF_ALL in this study resampled 1000 subgraphs to derive H 0 . b) Each edge in S is associated with a positive false discovery rate (q) estimated over p ij using H 1 and H 0 . The neighbourhood clustering subgraph C is induced by edges where the associated q ≤ Q. c) An iterative minimum cut procedure (Ford and Fulkerson, 1956) is applied to C until all components have density greater than or equal to a threshold Z. Edge weights in this procedure are taken as the negative log p-values from H 1 .
b) GMM is applied to identify structure in the NFCS distribution. Expectationmaximization fits a mixture of Gaussians to the distribution using independent mean and standard deviation parameters for each Gaussian (Dempster et al., 1977;Lubbock et al., 2013). Models with 1..9 Gaussians are fitted and the final model selected using the to the NFCS=0 nodes to produce NFCS_GN0. GMM is performed on NFCS_GN0 and nodes eliminated according to the above procedure on the resulting model. This procedure was developed following manual inspection of results on training data from 29 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint KEGG pathways with 'synthetic neutral target genes' (STNGs) as nodes resampled from G (TRAIN-CL, described in section 2.2.1).
Therefore, NetNC can be applied to predict functional coherence using either edge-centric or nodecentric analysis modes. The edge-centric mode automatically produces a network, whereas the node-centric analysis does not output edges; therefore to generate networks from predicted FBT nodes an edge pFDR threshold may be applied, pFDR≤0.1 was selected as the default value. The statistical approach to estimate pFDR and local FDR are described in the sections below.

Estimating positive false discovery rate for hypergeometric mutual clustering pvalues
The following procedure is employed to estimate positive False Discovery Rate (pFDR) (Storey, 2002) in the NetNC edge-centric mode. Subgraphs with number of nodes identical to S are resampled from G to derive a null distribution of HMC p-values (H 0 ) (section 4.2, above). The resampling approach for pFDR calculation in NetNC-FTI controls for the structure of the network G, including degree distribution, but does not control for the degree distribution or other network properties of the subgraph S induced by the input nodelist (D). In scale free and hierarchical networks, degree correlates with clustering coefficient; indeed, this property is typical of biological networks (Yamada and Bork, 2009). Part of the rationale for NetNC assumes that differences between the properties of G and S (for example; degree, clustering coefficient distributions) may enable identification of clusters within S. Therefore, it would be undesirable to control for the degree distribution of S during the resampling procedure for pFDR calculation because this would also partially control for clustering coefficient. Indeed clustering coefficient is a node-centric parameter that has similarity with the edge-centric Hypergeometric Clustering Coefficient (HMC) calculation (Goldberg and Roth, 2003) used in the NetNC algorithm to analyse S. Hence, the 30 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint resampling procedure does not model the degree distribution of S, although the degree distribution of G is controlled for. Positive false discovery rate is estimated over the p-values in H 1 (p ij ) according to Storey (Storey, 2002): Where: R denotes hypotheses (edges) taken as significant V are the number of false positive results (type I error) NetNC steps through threshold values (p α ) in p ij estimating V using edges in H 0 with p≤p α . H 0 represents Y resamples, therefore V is calculated at each step: The H 1 p-value distribution is assumed to include both true positives and false positives (FP); H 0 is taken to be representative of the FP present in H 1 . This approach has been successfully applied to peptide spectrum matching (Fitzgibbon et al., 2008;Sennels et al., 2009). The value of R is estimated by: Additionally, there is a requirement for monotonicity:

31
. CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint pFDR x+ 1 ≥ pFDR x , p x < p x +1 (6) Equation (6) represents a conservative procedure to prevent inconsistent scaling of pFDR due to sampling effects. For example consider the scaling of pFDR for pFDR x+1 at a p ij value with additional edges from H 1 but where no more resampled edges (i.e. from H 0 ) were observed in the interval between p x and p x+1 ; before application of equation (6), the value of pFDR x+1 would be lower than pFDR x . The approach also requires setting a maximum on estimated pFDR, considering that there may be values of p α where R is less than V. We set the maximum to 1, which would correspond to a prediction that all edges at p ij are FPs. The assumption that H 1 includes false positives is expected to hold in the context of candidate transcription factor target genes and also generally across biomedical data due to the stochastic nature of biological systems (Marusyk et al., 2012;Raj and van Oudenaarden, 2008;Raj et al., 2010). We note that an alternative method to calculate R using both H 1 and H 0 would be less conservative than the approach presented here.

Estimating local false discovery rate from global false discovery rate
We developed an approach to estimate local false discovery rate (lcFDR) (Efron et al., 2001), being the probability that an object at a threshold (p α ) is a false positive (FP). Our approach takes global pFDR values as basis for lcFDR estimation. In the context of NetNC analysis using the DroFN network, a FP is defined as a gene (node) without a pathway comembership relationship to any other nodes in the nodelist D. The most significant pFDR value (pFDR min ) from NetNC was determined for each node S i across the edge set S ij . Therefore, pFDR min is the pFDR value at which node S i would be included in a thresholded network. We formulated lcFDR for the nodes with pFDR min meeting a given p α (k) as follows: Where l denotes the pFDR min closest to and smaller than k, and where at least one node has pFDR min ≡pFDR l . Therefore, our approach can be conceptualised as operating on ordered pFDR min values. n indicates the nodes in D with pFDR min values meeting threshold k. X represents the number of nodes at p α ≡k. The number of FPs for nodes with p α ≡k (FP k ) is estimated by subtracting the FP for threshold l from the FP at threshold k. Thus, division of FP k by X gives local false discovery rate bounded by k and l ( Figure S5). If we define the difference between pFDR k and pFDR l : Substituting pFDR k for (pFDR l + pFDR Δ ) into equation (7) and then simplifying gives: Equations (7) and (9) do not apply to the node(s) in D at the smallest possible value of pFDR min because pFDR l would be undefined; instead, the value of lcFDR k is calculated as the (global) pFDR min value. Indeed, global FDR and local FDR are equivalent when H 1 consists of objects at a single pFDR min value. Taking the mean lcFDR k across D provided an estimate of neutral binding in the studied ChIP-chip, ChIP-seq datasets and was calibrated against mean lcFDR values from datasets that had a known proportion of Synthetic Neutral Target Genes (SNTGs). Estimation of the total proportion of neutral binding in ChIP-chip or ChIP-seq data required lcFDR rather than (global) pFDR and, for example, accounts for the shape of the H 1 distribution. In the context of NetNC analysis of TF_ALL, mean lcFDR may be interpreted as the probability that any candidate target gene is neutrally bound in the dataset analysed; therefore providing estimation of the total neutral binding proportion. Computer code for calculation of lcFDR is provided within the NetNC 33 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint distribution. Estimates of SNTGs by the NetNC-FBT approach were not taken forward due to large 95% CI values ( Figure S8).

Functional Target Identification and local False Discovery Rate
Candidate target genes that did not pass NetNC-FTI thresholds were labelled neutrally bound (FTI_NB). The proportion of FTI_NB genes was compared to the proportion of neutral binding estimated by lcFDR (lcFDR_NB, Figure 3A). The modulus of the difference between FTI_NB and lcFDR_NB for each dataset gave a distribution of differences in neutral binding proportion and the median of this distribution was quoted in the text above.

NetNC benchmarking and parameter optimisation
Gold standard data for NetNC benchmarking and parameterisation were taken as pathways from KEGG (v62, downloaded 13/6/12) (Kanehisa et al., 2010). Training data were selected as seven pathways (TRAIN-CL, 184 genes) and a further eight pathways were selected as a blind test dataset (TEST-CL, 186 genes) summarised in Table S7. For both TRAIN-CL and TEST-CL, pathways were selected to be disjoint and to cover a range of different biological functions. However, pathways with shared biology were present within each group; for example TRAIN-CL included the pathways dme04330 'Notch signaling' and dme04914 'Progesterone-mediated oocyte maturation', which are related by notch involvement in oogenesis (López-Schier and St Johnston, 2001;Schmitt and Nebreda, 2002). TEST-CL also included the related pathways dme04745 'Phototransduction' and dme00600 'Sphingolipid metabolism', for example where ceramide kinase regulates photoreceptor homeostasis (Acharya et al., 2003;Dasgupta et al., 2009;Yonamine et al., 2011).
Gold standard datasets were also developed in order to investigate the effect of dataset size and noise on NetNC performance. The inclusion of noise as resampled network nodes into the gold-

34
. CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint standard data was taken to model neutral TF binding (Li et al., 2008;Shlyueva et al., 2014) and matches expectations on data taken from biological systems in general (Marusyk et al., 2012;Raj and van Oudenaarden, 2008). Therefore, gold standard datasets were generated by combining TRAIN-CL with nodes resampled from the network (G) and combining these with TRAIN-CL. The final proportion of resampled nodes (Synthetic Neutral Target Genes, SNTGs) ranged from 5% through to 80% in 5% increments. Since we expected variability in the network proximity of SNTGs to pathway nodes (S), 100 resampled datasets were generated per %SNTG increment.
Further gold-standard datasets were generated by taking five subsets of TRAIN-CL, from three through seven pathways. Resampling was applied for these datasets as described above to generate node lists representing five pathway sets in TRAIN-CL by sixteen %SNTG levels by l00 repeats (TRAIN_CL_ALL, 8000 node lists). A similar procedure was applied to TEST-CL, taking from three through eight pathways to generate data representing six pathway subsets by sixteen noise classes. FTI was a binary classification task for discrimination of pathway nodes from noise, therefore all pathway nodes were taken as as positives and SNTGs were negatives for the FTI MCC calculation. The FTI approach therefore tests discrimination of pathway nodes from SNTGs, which 35 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint is particularly relevant to identification of functionally coherent candidate TF targets from ChIPchip or ChIP-seq peaks.
Parameter selection for NetNC on the FTI task analysed MCC values for the 100 SNTG resamples across five pathway subsets by sixteen SNTG levels in TRAIN-CL_ALL over the Q, Z values examined, respectively ranging from up to 10 -7 to 0.8 and from up to 0.05 to 0.9. Data used for optimisation of NetNC parameters (Q, Z) are available from the Biomodels database and contour plots showing mean MCC across Q, Z values per %SNTG are provided in Figure S7 Table S8. NetNC parameters were also determined for analysis without any prior belief about the %SNTG in the input data -and therefore generalise across a wide range of %SNTG and dataset sizes. For this purpose, a contour plot was produced to represent the proportion of datasets where NetNC performed better than 75% of the maximum performance across TRAIN-CL_ALL for the 36 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint FTI task in the Q, Z parameter space. The maximum circle approach described above was applied to the contour plot in order to derive 'robust' parameter values (Q, Z), which were respectively 0.120, 0.306 (NetNC-FTI).

Performance on blind test data
We compared NetNC against leading methods, HC-PIN (Wang et al., 2011) andMCL (Enright et al., 2002) on blind test data (Figure 2, Table S2). Previous work that evaluated nine clustering algorithms, including MCL, found that HC-PIN had strong performance in functional module identification and was robust against false positives (Wang et al., 2011); therefore HC-PIN was selected for extensive comparison against NetNC. Input, output and performance summary files for HC-PIN on TEST-CL are available from the Biomodels database (per datapoint, n=100 for NetNC, n=99 for HC-PIN). HC-PIN was run on the weighted graphs induced in DroFN by TEST-CL with default parameters (lambda = 1.0, threshold size = 3). MCL clusters in DroFN significantly enriched for query nodes from TEST-CL-SR were identified by resampling to generate a null distribution (Overton et al., 2011). Clusters with q<0.05 were taken as significant. MCL performance was optimised for the Functional Target Identification (FTI) task over the TRAIN-CL-SR datasets for MCL inflation values from 2 to 5 incrementing by 0.2. The best-performing MCL inflation value overall was 3.6 ( Table S9).

Subsampling of transcription factor binding datasets and statistical testing
Robustness of NetNC performance was studied by taking 95%, 80% and 50% resamples from nine public transcription factor binding datasets, summarised in section 4.3 and described previously in detail (MacArthur et al., 2009;Ozdemir et al., 2011;Roy et al., 2010;Sandmann et al., 2007;Zeitlinger et al., 2007). A hundred subsamples of each of these datasets were taken at rates of 95%, 80% and 50%, thereby producing a total of 2700 datasets (TF_SAMPL). NetNC-FTI results across 37 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint TF_SAMPL were used as input for calculation of median and 95% confidence intervals for the edge and gene overlap per subsampling rate for each transcription factor dataset analysed. The NetNC resampling parameter (Y) was set at 100, the default value. The edge overlap was calculated as the proportion of edges returned by NetNC-FTI for the subsampled dataset that were also present in NetNC-FTI results for the full dataset (i.e. at 100%). Therefore, nine values for median overlap and 95% CI were produced per subsampling rate for both edge and gene overlap, corresponding to the nine transcription factor binding datasets (Table S4). The average (median) value of these nine median overlap values, and of the 95% CI, was calculated per subsampling rate; these average values are quoted in Supplemental Material.
False discovery rate (FDR) correction of p-values was applied where appropriate and is indicated in this manuscript by the commonly used notation 'q' Benjamini-Hochberg correction was applied (Benjamini and Hochberg, 1995) unless otherwise specified in the text. Calculation of pFDR and local FDR values by NetNC is described in the sections above.

Transcription factor binding and Notch modifier datasets
We analysed public Chromatin Immunoprecipitation (ChIP) data for the transcription factors twist and snail in early Drosophila melanogaster embryos. These datasets were derived using ChIP followed by microarray (ChIP-chip) (MacArthur et al., 2009;Sandmann et al., 2007;Zeitlinger et al., 2007) and ChIP followed by solexa pyrosequencing (ChIP-seq) (Ozdemir et al., 2011).
Additionally 'highly occupied target' regions, reflecting multiple and complex transcription factor occupancy profiles, were obtained from ModEncode (Roy et al., 2010). Nine datasets were analysed in total (TF_ALL) and are summarised below.
The 'union' datasets (WT embryos 2-3h, mostly late stage four or early stage five) combined ChIP-chip peaks significant at 1% FDR for two different antibodies targeted at the same TF and 38 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint these were assigned to the closest transcribed gene according to RNA Polymerase II binding data (MacArthur et al., 2009). Additionally, where the closest transcribed gene was absent from the DroFN network then the nearest gene was included if it was contained in DroFN. This approach generated the datasets sna_2-3h_union (1158 genes) and twi_2-3h_union (1848 genes). The union of peaks derived from two separate antibodies maximised sensitivity and may have reduced potential false negatives arising from epitope steric occlusion. For the 'Toll 10b ' datasets, significant peaks with at least two-fold enrichment for Twist or Snail binding were taken from ChIP-chip data on Toll 10b mutant embryos (2-4h), which had constitutively activated Toll receptor (Stathopoulos et al., 2002;Zeitlinger et al., 2007); mapping to DroFN generated the datasets twi_2-4h_Toll 10b (1238 genes), sna_2-4h_Toll 10b (1488 genes). Toll 10b embryos had high expression of Snail and Twist, which drove all cells to mesodermal fate trajectories (Zeitlinger et al., 2007). The two-fold enrichment threshold selected for this study reflects 'weak' binding, although was expected to include functional TF targets (Biggin, 2011). Therefore the candidate target genes for twi_2-4h_Toll 10b and sna_2-4h_Toll 10b were expected to contain a significant proportion of false positives.
The Highly Occupied Target dataset included 38562 regions, of which 1855 had complexity score ≥8 and had been mapped to 1648 FlyBase genes according to the nearest transcription start site (Roy et al., 2010); 677 of these genes were matched to a DroFN node (HOT). The 'HighConf' data took Twist ChIP-seq binding peaks in WT embryos (1-3h) that had been reported to be 'high confidence' assignments; high confidence filtering was based on overlap with ChIP-chip regions, identification by two peak-calling algorithms and calibration against peak intensities for known Twist targets, corresponding to 832 genes (Ozdemir et al., 2011). A total of 664 of these genes were found in DroFN (twi_1-3h_hiConf) and represented the most stringent approach to peak calling of all the nine TF_ALL datasets. The intersection of ChIP-chip binding for two different Twist antibodies in WT embryos spanning two time periods (2-4h and 4-6h) identified a total of 1842 target genes (Sandmann et al., 2007) of which 1444 mapped to DroFN (Intersect_ALL). Subsets of 39 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint Intersect_ALL identified regions bound only at 2-4 hours (twi_2-4h_intersect, 801 genes), or only at 4-6 hours (twi_4-6h_intersect, 818 genes), or 'continuously bound' regions identified at both 2-4 and 4-6 hours (twi_2-6h_intersect, 615 genes). Assigned gene targets may belong to more than one subset of Intersect_ALL because time-restricted binding was assessed for putative enhancer regions prior to gene mapping; overlap of the Intersect_ALL subsets ranged between 30.2% and 55.4%. The Intersect_ALL datasets therefore enabled assessment of functional enhancer binding according to occupancy at differing time intervals and also to examine the effect of intersecting ChIPs for two different antibodies upon the proportion of predicted functional targets recovered.

40
. CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint The Notch signalling modifiers analysed in this study were selected based on identification in at least two of the screens reported in (Guruharsha et al., 2012).

Breast cancer transcriptome datasets and molecular subtypes
Primary breast tumour gene expression data were downloaded from NCBI GEO (GSE12276, GSE21653, GSE3744, GSE5460, GSE2109, GSE1561, GSE17907, GSE2990, GSE7390, GSE11121, GSE16716, GSE2034, GSE1456, GSE6532, GSE3494, GSE68892 (formerly geral-00143 from caBIG)). All datasets were Affymetrix U133A/plus 2 chips and were summarised with Ensembl alternative CDF (Dai et al., 2005). RMA normalisation (Irizarry et al., 2003) and ComBat batch correction (Johnson et al., 2007) were applied to remove dataset-specific bias as previously described (Moleirinho et al., 2013;Sims et al., 2008). Intrinsic molecular subtypes were assigned based upon the highest correlation to Sorlie centroids (Sørlie et al., 2003), applied to each dataset separately. Centred average linkage clustering was performed using the Cluster and TreeView programs (Eisen et al., 1998). Centroids were calculated for each gene based upon the mean expression across each of the Sorlie intrinsic subtypes (Sørlie et al., 2003). These expression values were squared to consider up and down regulated genes in a single analysis. Orthology to the DroFN network was defined using Inparanoid (Östlund et al., 2009). Differential expression was calculated by t-test comparing normalised (unsquared) expression values in normal-like and basallike tumours with false discovery rate correction (Benjamini and Hochberg, 1995).

MCF-7
Tet-On cells were purchased from Clontech and maintained as previously described (Liu et al., 2013).To analyse the ability of transfected MCF7 breast cancer cells to degrade and invade surrounding extracellular matrix, we performed an invasion assay using the CytoSelect™ 24-Well Cell Adhesion Assay kit. This transwell invasion assay allow the cells to invade through a 41 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint matrigel barrier utilising basement membrane-coated inserts according to the manufacturer's protocol. Briefly, MCF7 cells transfected with the constructs (Doxycycline-inducible SNAI1 cDNA or SNAI1 shRNA with or without candidate gene cDNA) were suspended in serum-free medium.
SNAI1 cDNA or SNAI1 shRNA were cloned in our doxycyline-inducible pGoldiLox plasmid (pGoldilox-Tet-ON for cDNA and pGolidlox-tTS for shRNA expression) using validated shRNAs against SNAI1 (NM_005985 at position 150 of the transcript (Liu et al., 2013)). pGoldilox has been used previously to induce and knock down the expression of Ets genes (Peluso et al., 2017).
Following overnight incubation, the cells were seeded at 3.0×10 5 cells/well in the upper chamber and incubated with medium containing serum with or without doxycyline in the lower chamber for 48 hours. Concurrently, 10 6 cells were treated in the same manner and grown in a six well plate to confirm over-expression and knockdown. mRNA was extracted from these cells and quantitative real-time PCR (RT-qPCR) was performed as previously described (Essafi et al., 2011); please see Supplemental File 2 for gene primers. The transwell invasion assay evaluated the ratio of CyQuant dye signal at 480/520 nm in a plate reader of cells from the two wells and therefore controlled for potential proliferation effects associated with ectopic expression. We used empty vector (mCherry) and scrambled shRNA as controls and to control for the non-specific signal. At least three experimental replicates were performed for each reading.   The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint heatmap, colours correspond to: HER2-overexpressing (purple), basal-like (red), normal-like (green), luminal A (dark blue) and luminal B (light blue). Clustering based on sna_2-3h_union shows better separation of clusters than twi_2-3h_union.
Breast cancer subtypes are indicated on the x-axis of each plot as follows: B (basal-like), E (HER2overexpressing), LA (luminal A), LB (luminal B), NL (normal-like). Similar patterns were observed between these two datasets, with strongest enrichment for basal-like and normal-like centroids; sna_2-3h_union also showed relatively stronger enrichment for the luminal A centroid compared with results for twi_2-3h_union. Absolute differences in p between sna_2-3h_union and twi_2-3h_union may reflect the difference in dataset sizes (n=256 vs n=467), indeed stronger p-values are typically associated with comparisions across larger datasets.
Page s8 of s24       The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint with Delta (Dl). O-fut1 also has a role in modulating the interaction.
Rumi acts in the endoplasmic reticulum to glucosylate the EGF modules in N, this is required for correct folding and cleavage of N. FBgn0011225 jaguar (7/9) Myosin is a protein that binds to actin and has ATPase activity that is activated by actin. Together CLIP-190 and jar may coordinate the interaction between the actin and microtubule cytoskeleton. May link endocytic vesicles to microtubules and may be involved in transport in the early embryo and in the dynamic process of dorsal closure. It is believed that its function changes during the life cycle. FBgn0003254 ribbon (7/9) *Ribbon is a nuclear BTB-domain protein, expressed in most embryonic cells. It is required for development of the salivary gland and trachea, as well as for dorsal closure. It regulates both growth and differentiation of salivary gland cells.
FBgn0000014 abdominal A (7/9) Sequence-specific transcription factor which is part of a developmental regulatory system that provides cells with specific Page s18 of s24 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint positional identities on the anterior-posterior axis. Required for segmental identity of the second through eighth abdominal segments.
Once a pattern of abd-A expression is turned on in a given parasegment, it remains on the more posterior parasegment, so that the complex pattern of expression is built up in the successive parasegments. Appears to repress expression of Ubx whenever they appear in the same cell, but abd-A is repressed by Abd-B only in the eight and ninth abdominal segments.
FBgn0003900 twist (7/9) Involved in the establishment and dorsoventral patterning of germ layers in the embryo.

FBgn0003177
Current ID: FBgn0262614 polychaetoid (7/9) *Polychaetoid is a broadly acting protein that is associated with multiple proteins at the surface and within the cytoskeleton, connecting events between the two.
Page s20 of s24 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the  The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/455709 doi: bioRxiv preprint