LSTrAP-Cloud: A User-friendly Cloud Computing Pipeline to Infer Co-functional and Regulatory Networks

As genomes become more and more available, gene function prediction presents itself as one of the major hurdles in our quest to extract meaningful information on the biological processes genes participate in. In order to facilitate gene function prediction, we show how our user-friendly pipeline, Large-Scale Transcriptomic Analysis Pipeline in Cloud (LSTrAP-Cloud), can be useful in helping biologists make a shortlist of genes that they might be interested in. LSTrAP-Cloud is based on Google Colaboratory and provides user-friendly tools that process and quality-control RNA sequencing data streamed from the European Sequencing Archive. LSTRAP-Cloud outputs a gene co-expression network that can be used to identify functionally related genes for any organism with a sequenced genome and publicly available RNA sequencing data. Here, we used the biosynthesis pathway of Nicotiana tabacum as a case study to demonstrate how enzymes, transporters and transcription factors involved in the synthesis, transport and regulation of nicotine can be identified using our pipeline.

the start of 2010 and 2020, respectively. Analysing this data would have been unthinkable a decade ago, 48 due to limitations in software used to estimate gene expression from RNA-seq data. However, drastic 49 improvements in software used to estimate gene expression from RNA-seq data, such as kallisto [22] 50 and salmon [23], have made this task possible within reasonable time on a typical desktop or even 51 a Raspberry Pi-like miniature computer [24]. Furthermore, multiple user-friendly pipelines are an 52 invaluable resource both for experts and non-bioinformaticians, to which pipelines such as UTAP 53 [25], CURSE [26], LSTrAP-Lite [24] and LSTrAP [27] are made publicly available. However, all these 54 resources typically require complex installation or a linux environment. 55 The introduction of cloud computing has provided alternatives to how data can be managed, successfully. Seven files were not found and four files had unacceptable download speed among the 89 files that were not processed successfully (Table S2) Nitab-v4.5_cDNA_Edwards2017.fasta CDS (Table S3). The co-expression neighbourhood of the gene 97 of interest is also displayed at the end of the colab notebook.

98
To generate Figure 3, The co-expression network of Nitab4.5_0000884g0010.1 was downloaded 99 from colab as a JSON file and modified in Cytoscape desktop v3.7.1 (Table S4). For brevity, only 100 transporters, transcription factors and genes involved in nicotine biosynthesis are shown, but the 101 network containing all 50 genes is available ( Figure S1). of these genes in the tobacco genome version Nitab-v4.5_cDNA_Edwards2017.fasta were identified 109 through blast v2.6.0+ against the N. tabacum CDS. The function of the genes found in the network was 110 further annotated using results from blast and Mercator (Table S3 and 5).  (Table S1). Only wild-type and untreated experiments indicating leaf, flower, root, shoot and stem 115 were selected for the analysis (Table S6). The median expression value of a gene in an organ was 116 normalised with the highest median expression value of the gene across all organs.  which is needed to run the notebook. We provide a user's manual (Document S1) and SRA experiment 135 list (

184
As we sit on an expanding trove of data today, there is an immense amount of knowledge to 185 be uncovered with the improvement in gene annotation and characterisation. Classical genomic 186 approaches have allowed us to rapidly annotate genomes in silico based on sequence similarity to 187 existing sequences. This approach has its limitations but can be greatly improved when the spatial and 188 temporal expression of genes is taken into account.

189
In this study, we leveraged on the benefits of cloud computing and the user-friendliness of 190 the Jupyter notebooks to implement a large-scale transcriptomic analysis pipeline, LSTrAP-Cloud 191 (Figure 1). Using N. tabacum as an example (Figure 2), we showed that co-expression networks not only 192 identified the enzymes involved in the metabolism of nicotine but also regulators and transporters that 193 are found up-and down-stream of nicotine biosynthesis (Figures 3 and 4). The example of nicotine 194 biosynthesis demonstrates that co-expression networks analysis is a valuable addition to sequence 195 similarity-based approaches as it can infer modules of functionally related genes.

196
While the field of bioinformatics is advancing rapidly, it is important that biologists are also 197 empowered with the tools and predictions available to bioinformaticians, as this can greatly shorten 198 the amount of time required for gene characterisation through the identification of potential targets.

199
The future of gene function prediction, however, will require a new generation of biologist equipped     The following abbreviations are used in this manuscript: