MyBrain-Seq: A Pipeline for MiRNA-Seq Data Analysis in Neuropsychiatric Disorders

High-throughput sequencing of small RNA molecules such as microRNAs (miRNAs) has become a widely used approach for studying gene expression and regulation. However, analyzing miRNA-Seq data can be challenging because it requires multiple steps, from quality control and preprocessing to differential expression and pathway-enrichment analyses, with many tools and databases available for each step. Furthermore, reproducibility of the analysis pipeline is crucial to ensure that the results are accurate and reliable. Here, we present myBrain-Seq, a comprehensive and reproducible pipeline for analyzing miRNA-Seq data that incorporates miRNA-specific solutions at each step of the analysis. The pipeline was designed to be flexible and user-friendly, allowing researchers with different levels of expertise to perform the analysis in a standardized and reproducible manner, using the most common and widely used tools for each step. In this work, we describe the implementation of myBrain-Seq and demonstrate its capacity to consistently and reproducibly identify differentially expressed miRNAs and enriched pathways by applying it to a real case study in which we compared schizophrenia patients who responded to medication with treatment-resistant schizophrenia patients to obtain a 16-miRNA treatment-resistant schizophrenia profile.


Introduction
The rapid development in the field of transcriptomics has allowed the existing knowledge about the molecular mechanisms of pathogenesis to be expanded. Over the last two decades, RNA-Seq technology has been used in translational medicine as a valuable tool for disease profiling and biomarker identification. This has led to important discoveries [1][2][3] and the foundation of precious resources such as ENCODE [4], the Cancer Genome Atlas [5] or the pathway database Reactome [6]. RNA-Seq technology has also allowed a new approach to study some of the most heterogeneous disorders, such as many neuropsychiatric conditions. The study of gene-expression regulators such as microRNAs (miRNAs) has offered a new angle from which neuropsychiatric conditions could be understood: the environmental influence on gene expression and the different responses to that environment. This application of the RNA-Seq technology is commonly known as miRNA-Seq.

Quality Control and Adapter Removal
The first two preprocessing steps ( Figure 1, step 1, a & b) are to discover samples with low-quality sequences as well as to detect the presence of foreign DNA contamination such as sequencing adapters. An evaluation of the results of this module is important for discarding samples from the downstream analysis and therefore it is the first step to be performed.
On one hand, myBrain-Seq includes the analysis of sequence quality scores, sequence length distribution, overrepresented sequences, adapter content and other parameters included in the FastQC tool [41]. On the other hand, the adapter trimming is performed using the Cutadapt tool [42]. Both quality control and adapter trimming analyze several samples at once, thus accelerating the analysis of large sample sizes.

Alignment to the Reference Genome
Alignment is the process of mapping reads against a biological reference (usually a genome or transcriptome) in order to assign genomic positions to each of the sequences

Quality Control and Adapter Removal
The first two preprocessing steps ( Figure 1, step 1, a & b) are to discover samples with low-quality sequences as well as to detect the presence of foreign DNA contamination such as sequencing adapters. An evaluation of the results of this module is important for discarding samples from the downstream analysis and therefore it is the first step to be performed.
On one hand, myBrain-Seq includes the analysis of sequence quality scores, sequence length distribution, overrepresented sequences, adapter content and other parameters included in the FastQC tool [41]. On the other hand, the adapter trimming is performed using the Cutadapt tool [42]. Both quality control and adapter trimming analyze several samples at once, thus accelerating the analysis of large sample sizes.

Alignment to the Reference Genome
Alignment is the process of mapping reads against a biological reference (usually a genome or transcriptome) in order to assign genomic positions to each of the sequences of the raw data. The alignment in myBrain-Seq ( Figure 1, step 1, c-e) encompasses building of a genome index (if needed), the alignment of the reads to a reference genome, a file format conversion of the results and a quality control of the alignments. First, myBrain-Seq performs the alignments with the Bowtie 1 tool [43], a short sequence aligner specializing in mapping short transcripts (such as miRNAs) to large genomes [7]. Bowtie uses a Burrows-Wheeler index of the genome to speed the mapping process and reduce the impact on computer memory. This small memory footprint is used by myBrain-Seq to parallelize the alignments of several samples to accelerate the processing of large batches of files. Next, a file format conversion from Sequence Alignment Map (SAM) to Binary Alignment Map (BAM) is performed with SAMtools [44]. Finally, a quality control with SAMtools stats and SAMtools bcftools [45] brings information about the sequencing depth of the samples, an important parameter that allows an estimate of the reliability of the RNA-Seq data. Other parameters such as ACGT cycles, sequencing coverage and GC content are also reported.

Transcripts Annotation and Quantification
Aligned sequences in BAM are then mapped to a reference transcriptome. This process, usually known as "annotation", uses the genomic coordinates of an annotation file to convert the genomic locations of the aligned samples into transcript IDs. Those transcripts are then grouped by ID and quantified in a process known as read summarization or quantification. MyBrain-Seq performs both processes, annotation and quantification, using the software FeatureCounts [46] (Figure 1, step 1, f). It also requires a user-uploaded GTF/GFF file as the biological reference for annotations. Results of the quantification are presented as plain transcript counts.

Differential Expression Analysis
A differential expression analysis (DEA) is the process of finding statistically significant differences in the transcript expression between two different conditions. Those differences can be potentially related to biological alterations and usually are the starting point of the functional analysis and the interpretation of results. MyBrain-Seq uses two well-known software packages for DEA, namely DESeq2 [47] and EdgeR [48] (Figure 1, step 2, a & b), and implements a volcano plot for the visualization of the differentially expressed miRNAs (DE miRNAs). Additionally, myBrain-Seq implements an option to integrate the results of both pieces of software ("integrated results" henceforth), thus offering a more conservative analysis of the data ( Figure 1, step 2, c).
Regarding the software for differential expression analysis, DESeq2 normalizes the counts using the median of ratios method [49] prior to the DEA using a negative binomial distribution. On the other hand, EdgeR uses a weighted trimmed mean of the log expression ratios between the samples for the normalization (TMM method) [50] and an exact test for calculating the statistical differences in the miRNA expression. Both software packages apply the Benjamini-Hochberg FDR correction to the p-values [51]. MyBrain-Seq is also able to adjust the DEA model for covariates using the user input (more details in the Results section).
The integration of the DESeq2 and EdgeR results is performed by finding their common miRNAs; then, for each of those miRNAs, FDR and p-values are averaged. Finally, FDR and log 2 FC thresholds are set (default FDR < 0.05; |log 2 FC| ≥ 0.5) to obtain the list of DE miRNAs. The user can manually adjust both of these thresholds as well as the ones used for subsequent tasks (refer to myBrain-Seq documentation). Additionally, a Venn diagram is created to offer a visual representation of the coincidences: first DESeq2 and EdgeR results are filtered by FDR (default FDR < 0.05; |log 2 FC| ≥ 0.5), then coincidences and differences are counted and plotted using the R package "VennDiagram" [52]. Volcano plots of all the results are created using the R package "EnhancedVolcano" [53].

Hierarchical Clustering
In the hierarchical clustering step, samples are assigned into clusters using the expression of the DE miRNAs. Samples with similar expression levels will group closely. This grouping can ease the identification of unknown relationships between the data as well as being an estimation of the classificatory power of the DE miRNAs. MyBrain-Seq automatically performs a hierarchical clustering analysis after the DEA using the R package "hclust" [54] (Figure 1, step 3). The whole process is divided into two steps: first, the data are prepared for the hierarchical clustering; second, the hierarchical clustering and figures are generated. The first step, preparation of the data, comprises the optional normalization of all the samples with DESeq2 [47] (flag -deseqNormalizationHclust) and the filtering of the DESeq2 and/or EdgeR results by FDR and log 2 FC (default FDR ≤ 0.05; |log 2 FC| ≥ 0.5) to obtain the DE miRNAs. After that, a table is built with one column per sample, each column being the DE miRNAs counts for a specific sample. In the second step, the counts are scaled using the R "scale" function, the matrix of Euclidean distances between samples is created and the clustering is performed using the "ward.D2" method of the R package "hclust" [54]. Finally, a dendrogram and a heatmap are generated using the R packages "dendextend" [55] and the function "heatmap.2" of the package gplots [56], respectively.

Functional Analysis
A functional analysis aims to put the differences spotted in the DEA into a biological context, suggesting biological pathways that might be useful for the investigator to analyze. The functional enrichment analysis of myBrain-Seq ( Figure 1, step 4) uses two annotation sources which are included in the myBrain-Seq Docker image: the miRNA-gene annotations from the Diana TarBase [30] and the gene-pathway annotations of the Reactome databases [6]. We selected Diana TarBase as our target annotation database because it contains only miRNA-target interactions that have been experimentally validated. Additionally, both Diana TarBase and Reactome are publicly accessible and updated on a regular basis. Regarding the enrichment analysis, myBrain-Seq follows the strategy proposed by Godard and van Eyll [21] to be specific for miRNA data, briefly: (i) using TarBase annotations, protein coding genes in Reactome pathways are converted into lists of miRNAs that target at least one of these genes; (ii) enrichment analysis is performed by comparing DE miRNAs of DESeq2, EdgeR or integrated results to the lists of miRNAs previously associated with the different pathways. Enrichment scores are calculated using a Fisher hypergeometric test. Finally, a word analysis on the enriched terms is performed using the R package "tidytext" [57] and the results are summarized in a figure and presented along with the pathway-enriched table.
Conventional enrichment analysis is performed on genes rather than miRNAs. In a miRNA-Seq analysis, genes are indirectly selected by target prediction; therefore, genes with more targets have more chances to be selected. The consequence of this bias is a non-specific result, usually identifying as enriched the most studied pathways such as cancer-related or generic signaling pathways [21]. MyBrain-Seq enrichment analysis deals with this bias as each miRNA is only represented once in each pathway, thus ensuring the specificity of the results.

MiRNA-Protein Interaction Network
After the functional analysis, the most enriched pathway is used to build a network of miRNA-protein interactions, providing the researcher with a possible molecular context for the observed differences in miRNA expression. The miRNA-protein interaction network step of myBrain-Seq ( Figure 1, step 5) uses the same annotation files as in the functional enrichment analysis step plus a protein-protein interaction file from the Reactome database [6]. The network is built by expanding the miRNA-protein interactions present in the most enriched pathway with the Reactome protein-protein interactions. The process is as follows: (i) miRNAs and genes that participate in the most enriched pathway are found using the functional analysis result; (ii) miRNA-protein interactions are found by using the Tarbase annotations file [30]; (iii) protein-protein interactions that participate in the most enrichment pathways are found using the protein-protein interaction file; and (iv) miRNA-protein interactions and protein-protein interactions are merged into a single table. Finally, a table with all the interactions is generated. This table can be easily imported into network analysis software such as Cytoscape [58] for further analysis and expansions. Additionally, an interactive network file in HTML is generated using the R packages "networkD3" [59] and "htmlwidgets" [60].

Summarization of the Quality Controls
Results of the quality control of the samples and the alignment are calculated on a per-sample basis, making overall interpretation of data quality difficult. To avoid this, the last step of myBrain-Seq analysis is the summarization of featureCounts [46], Samtools [45] and FastQC [41] results using MultiQC [61] (see Figure 1). The output of this step is a single HTML report from which different tables and graphs can be generated.

MyBrain-Seq Implementation
MyBrain-Seq is implemented as a Compi pipeline [39,40] and distributed as a Docker image that allows running it effortlessly. All external dependencies (Table 1) are satisfied using Docker images from the pegi3s Bioinformatics Docker images Project [62] (https://pegi3s. github.io/dockerfiles/). The source code of the pipeline is publicly available at GitHub (https://github.com/sing-group/my-brain-seq) under an MIT LICENSE, and the Docker image is available at Docker Hub (https://hub.docker.com/r/singgroup/my-brain-seq) and at Compi Hub (https://www.sing-group.org/compihub/explore/625e719acc1507001 943ab7f#overview). The application of myBrain-Seq to biomedical research is illustrated in a recent study [8] with the comparison of the miRNA profile of patients with schizophrenia (SZ) and treatment-resistant schizophrenia (TRS). The dataset comprises reads of circulating miRNA of 40 human patients with schizophrenia, of which 19 patients have a normal response to medication (MR; n = 19) and 21 have an insufficient response to medication (MNR; n = 21). The dataset can be downloaded from Gene Expression Omnibus (GEO) under the accession number GSE223043. We also provide a script in Supplementary Material Script S1 to automatically download and perform myBrain-Seq analysis on this dataset.

Results and Discussion
The goal of myBrain-Seq is to offer a modular and highly customizable tool for miRNA-Seq analysis that allows performing replicable studies. It offers a straightforward analysis process that brings together the most common tools in the field embedded in a portable and customizable pipeline. Among myBrain-Seq's main contributions are the options designed to solve typical problems in the analysis of transcriptomic data such as the high variability of the transcripts abundance (covariate adjustment), bias in the pathway-enrichment analysis resulting from the indirect selection of genes (miRNA-oriented pathway analysis strategy), low replicability of the results (containerized processes, DEA integration) or the need to explore the results in a biological context (miRNA-protein interaction network). The following subsections describe the contributions of myBrain-Seq with more details.

MyBrain-Seq Execution
The sequence of steps needed to start a myBrain-Seq can be described as follows: 1.
Creation of the directory tree in the local file system, referred to as "working directory", shown in Figure 2. The working directory consists of a main directory with two subdirectories: "/input" and "/output". The input subdirectory is where the parameter files of myBrain-Seq should be placed; the output subdirectory will contain the results after myBrain-Seq execution. This working directory can be initialized using the utilities included in the myBrain-Seq Docker image. This initialization creates a "run.sh" file, used to run the pipeline and templates of the other files required by myBrain-Seq (those inside "/input"). A "README.txt" file is also created with the instructions to fill the template files and run the pipeline.
The sequence of steps needed to start a myBrain-Seq can be desc 1. Creation of the directory tree in the local file system, referred to tory", shown in Figure 2. The working directory consists of a two subdirectories: "/input" and "/output". The input subdirect rameter files of myBrain-Seq should be placed; the output subd the results after myBrain-Seq execution. This working directory ing the utilities included in the myBrain-Seq Docker image. This a "run.sh" file, used to run the pipeline and templates of the ot myBrain-Seq (those inside "/input"). A "README.txt" file is a instructions to fill the template files and run the pipeline. 2. The second step is the preparation of the data. In addition myBrain-Seq needs a reference genome (or Bowtie index) and a G annotations as biological references to perform the analysis. It is not mandatory, to put all these files inside subdirectories under less, if they are in other locations (e.g., a shared directory to s provided "run.sh" will take care of this and create the appropr bindings in a transparent manner for the user. 3. The third step is the configuration of the analysis. This comprises files, namely: "compi.parameters", "conditions_file.txt" and These files are usually placed into the "input" directory. a. The "compi.parameters" file contains the paths and pa the analysis, i.e.: path of the working directory, paths to F logical references, paths to "conditions_file.txt" and to "c the adapter sequence. For more information about the that can be added, refer to the myBrain-S (h ps://github.com/sing-group/my-brain-seq).

2.
The second step is the preparation of the data. In addition to the FASTQ files, myBrain-Seq needs a reference genome (or Bowtie index) and a GFF file with miRNA annotations as biological references to perform the analysis. It is recommended, but not mandatory, to put all these files inside subdirectories under "/input". Nevertheless, if they are in other locations (e.g., a shared directory to save disk space), the provided "run.sh" will take care of this and create the appropriate Docker volume bindings in a transparent manner for the user.

3.
The third step is the configuration of the analysis. This comprises the creation of three files, namely: "compi.parameters", "conditions_file.txt" and "contrast_file.txt". These files are usually placed into the "input" directory.
a. The "compi.parameters" file contains the paths and parameters needed for the analysis, i.e.: path of the working directory, paths to FASTQ files and biological references, paths to "conditions_file.txt" and to "contrast_file.txt" and the adapter sequence. For more information about the optional parameters that can be added, refer to the myBrain-Seq user manual (https://github.com/ sing-group/my-brain-seq). b.
The "conditions_file.txt" contains the metadata regarding names and conditions of each fastQ file. This file is used by myBrain-Seq to link each sample with a condition and its covariates. Each row of this file contains the name of the FASTQ file, its condition, a user label for that sample and zero or more columns describing the covariates for that sample (e.g., age, sex). All the covariates added in this file will be used in the DEA to adjust the statistical model. c. The "contrast_file.txt" contains the conditions to compare during the analysis and a label for each contrast. Conditions included in this file must be the same as those stated in "contitions_file.txt". MyBrain-Seq can perform several contrasts in the same pipeline execution if several contrasts are specified in this file, one per line.

4.
The final step is running myBrain-Seq analysis using the "run.sh" script created during the working directory initialization (step number 1). This script will use "compi.parameters" as reference, mount all the needed Docker volumes (by extracting the path from the Compi parameters file) and create a directory for the log files of the current execution. MyBrain-Seq users do not need to modify this file, as it is ready to use. Thus, users only need to run the script using the path to "compi.parameters" as the unique argument to start the myBrain-Seq analysis.

5.
Both final and intermediate results are saved in the "/output" directory. Such output files are placed in directories corresponding to the different steps of the workflow, namely: "1_fastqc", "2_cutadapt", "3_bowtie", "4_bam_stats", "5_feature_counts", "6_deseq2", "6_deseq2+edger", "6_edger" and "7_multiqc". Results from the hierarchical clustering, functional analysis and network analysis are placed in the directories prefixed with "6_", according to the data from which they were generated. Files from the same contrast are grouped in subdirectories named with the contrast label.

MyBrain-Seq Results
The results of myBrain-Seq are placed in sub-directories inside the "output" directory (see Figure 2). The main results stem from the DEA and are presented for each contrast specified in "contrast_file.txt" and for each DEA method. Figure 3 illustrates the graphical results of a myBrain-Seq analysis:  • Results of the DEA; Figure 4A. Full table in supplementary Table S1. • List of DE miRNAs; Figure 4B. • Enriched pathways; Figure 4C. Full table in supplementary Table S2. • miRNA-protein interaction network; Figure 4D. Full table in supplementary Table S3. Finally, myBrain-Seq generates an HTML file with a summary of the results of the quality control, alignments, assignments and with the quantification of all the samples.    Results of the DEA; Figure 4A. Full table in supplementary Table S1.  List of DE miRNAs; Figure 4B.  Enriched pathways; Figure 4C. Full table in supplementary Table S2.  miRNA-protein interaction network; Figure 4D. Full table in supplementary Table  S3. Finally, myBrain-Seq generates an HTML file with a summary of the results of the quality control, alignments, assignments and with the quantification of all the samples.

Case Study
An early version of myBrain-Seq (v0.1.0) was used in Pérez-Rodríguez et al. 2023 to perform all analysis steps up to quantification [8]. In this study, myBrain-Seq was used to identify 16 differentially expressed miRNAs (DE miRNAs) between the MR and MNR conditions using DESeq2. The analysis performed can be described using Figure 1, where the reference factor "X" is the schizophrenia condition (MR), the response factor "Y" is the treatment-resistant schizophrenia condition (MNR) and six variables (n = 6) were used to adjust the differential expression model (V1, V2, . . . , V6). These six variables are: processing bath, sex, drug consumption (alcohol OR tobacco OR illegal), time (hospital arrival/discharge), treatment based on diazepines, oxazepines, thiazepines and oxepins and treatment based on other antipsychotics.
However, there are remarkable differences in the functional analysis and in the miRNAprotein network due to the application of different methodologies applied in both analyses. In Pérez-Rodríguez et al. 2023, we performed a bibliographic search followed by a target prediction to enrich Tarbase [30] annotations. After that, the network was built on Cytoscape [58], expanded and filtered using custom Cytostape [58] filters and StringApp [63]. The pathway enrichment was later performed using the molecules present in the resulting network. On the other hand, as explained before, myBrain-Seq v1.0.0 takes a simpler but effective approach to the functional analysis: to avoid the artificial miRNA target overrepresentation and construct a reduced miRNA-protein network. With this approach, the network is produced after the functional enrichment, thus being smaller and with no overrepresentation biases. This network can be further expanded and filtered externally as we did in Pérez-Rodríguez et al. 2023.
Regarding the enriched pathways, Table 2 offers a comparison between the top ten Reactome-enriched pathways [6] in Pérez-Rodríguez et al. 2023 and myBrain-Seq v1.0.0. In the original study, several pathway databases were used in order to perform the enrichment analysis. On the other hand, myBrain-Seq v1.0.0 only uses Reactome annotations. There are no coincidences between these top ten enriched pathways, probably because StringApp enrichment uses annotations of all levels of the Reactome pathway hierarchy whereas myBrain-Seq uses only the lowest level annotations. This has the advantage of providing more useful results by discarding big unspecific pathways such as "Metabolism of proteins", "Developmental Biology" or "Disease", which provide little help in getting to the molecular causes of a condition. Regarding this matter, 217 of the enriched pathways that were discovered using myBrain-Seq v.  Table S4). However, their significance in the context of the entire table differs significantly. For example, the second most enriched pathway in myBrain-Seq, "Activation of anterior HOX genes in hindbrain development during early embryogenesis" (see Table 2), which has a q-value of 1.22 × 10 −5 , is ranked at position 86 and has a q-value of 2.46 × 10 −7 in the results of Pérez-Rodríguez et al. 2023, which is two orders of magnitude lower (see Supplementary Table S4). In relation to this, it is also worth noting the differences in the scale of the p and q values. These differences are likely due to both the lack of specificity of the pathways and the overrepresentation of miRNA targets, resulting in imbalances in the enrichment values (see functional analysis section). Both phenomena produce a high number of false positives, which in turn forces the use of additional filtering strategies for the identification of relevant pathways. Thanks to myBrain-Seq correction, interpretations can be made straight from the enriched table.

Conclusions
MyBrain-Seq is a highly modular bioinformatics pipeline that specializes in replicable miRNA-Seq data analyses. Created using Compi, it provides a complete set of analyses ranging from raw data preprocessing and DEA to hierarchical clustering, functional analysis and network creation. MyBrain-Seq adaptations to miRNA data include the use of a shortsequence aligner, correction of the DEA model for confounding factors, correction of the artificial target overrepresentation in the functional analysis and creation of a miRNAprotein interaction network.
MyBrain-Seq has already been used in a real case study in which we compared schizophrenia patients who responded to medication to treatment-resistant schizophrenia patients to obtain a 16-miRNA treatment-resistant schizophrenia profile [8]. We were able to reproduce all the findings from the case study up to the functional analysis stage, including the miRNA profile suggested in Pérez-Rodríguez et al. 2023, as well as the hierarchical clustering. By making adjustments to myBrain-Seq functional analysis, we were able to generate more succinct and insightful enriched pathways, while also preventing any biases in the results that could have been caused by an overrepresentation of miRNA targets.
Overall, myBrain-Seq offers a powerful and reliable tool for miRNA-Seq data analysis, which can help researchers identify meaningful biological insights with greater confidence and ease. By reducing the need for additional analysis and providing a complete pipeline for data processing, DEA, hierarchical clustering, functional analysis and network creation, myBrain-Seq can streamline the research process and promote greater reproducibility of the results.
The tool is open for further extension and new features will be included as it is used in new studies. MyBrain-Seq is freely distributed under an MIT license and a complete manual is available at https://github.com/sing-group/my-brain-seq.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: MyBrain-Seq is freely distributed under an MIT license at https://github. com/sing-group/my-brain-seq. The dataset used for the case study is publicly available under Gene Expression Omnibus (GEO) accession number GSE223043.