With the availability and low cost of next-generation sequencing (NGS) technologies, genome-wide omics datasets are currently being generated at higher frequencies than ever before. These resources have rapidly expanded our knowledge of functional genomics in both animals and plants [1
]. However, the storage, analysis, management, and maintenance of the massive quantities of data produced by NGS remain quite challenging. In response to accumulating NGS-generated data, various bioinformatics databases have become popular, such as the Gene Expression Omnibus (GEO) from NCBI and ArrayExpress from EBI [2
]. In previous studies, a range of databases relating to Brassica
genetics, genomics and related activities have been proposed [4
]. The most widely used reference genome for B. napus
is currently stored in the Genoscope (https://wwwdev.genoscope.cns.fr/brassicanapus/
]. However, as a result of structural variations, a single reference genome is unable to cover the entire gene content of a species. Therefore, pangenomics analysis was proposed to ensure genomic diversity within a species is fully represented. The Brassica napus
pan-genome information resource (BnPIR, http://cbi.hzau.edu.cn/bnapus/
) provided eight high-quality reference genomes representing different ecotypes to help researchers get a better understanding of the genome structure and genetic basis of morphotype differentiation in rapeseed [6
]. Besides, other Brassica
pangenome databases were also hosted in the Brassica
]. These databases provide us with high-quality reference genomes and pangenomes, which greatly enable us to identify gene function with more accuracy and convenience.
Among the multiple plant omics datasets, transcriptomic data provide important clues to help predict gene function or reveal hidden molecular mechanisms based on gene expression profiles [7
]. Large-scale transcriptome analyses in plants have also led to the development of databases such as PlantExpress [8
]. At the transcriptome level, RNA sequencing (RNA-Seq) has emerged as an important approach for comprehensive gene expression analysis. RNA-Seq has several advantages over other techniques [9
]. RNA-Seq can be used to comprehensively measure the expression levels of all transcripts in a plant tissue without the need to design probes. Since the emergence of RNA-Seq, RNA-Seq data are continuously being deposited into public databases, such as the NCBI Sequence Read Archive (SRA) database [10
]. Other important databases also housing large-scale RNA-Seq data from the animal and plants fields include Silk DB, Melonet-DB, and ePlant [11
]. However, a gene expression database for rapeseed based on RNA-Seq is still lacking.
Rapeseed (Brassica napus
= 38, AACC) is an allotetraploid species that was formed as a result of spontaneous interspecific hybridization between Brassica oleracea
= 18) and Brassica rapa
= 20) [14
]. Since genomic information for rapeseed first became publicly available [5
], numerous transcriptome studies have been conducted to enhance our understanding of gene function in this important crop [15
]. However, these gene expression datasets remain to be further integrated and explored.
Here, we constructed the online database Brassica
Expression Database (BrassicaEDB, https://biodb.swu.edu.cn/brassica/
). We generated a large-scale gene expression profile of rapeseed based on RNA-Seq data obtained from 103 tissues from rapeseed cultivar ZS11 during seven developmental stages (germination, seedling, bolting, initial flowering, full-blooming, podding, and maturation) (Figure 1
). We chose this elite cultivar for its ultra-high oil content, high lodging resistance, high disease resistance, and low erucic acid and glucosinolate levels, as well as its broad eco-physiological adaptation to different climatic conditions worldwide [18
]. Since the Arabidopsis eFP browser has been the first web tool that enables in silico gene expression analysis [19
], a number of transcriptome datasets have been integrated in the eFP browser, including different plant tissues and organs in normal conditions and in response to abiotic or biotic stress conditions. We therefore utilized the eFP browser on our website, allowing users to comprehensively view gene expression levels during tissues at various stages of development. To ensure that transcriptome data stored in the SRA database could be further mined and integrated, we also analyzed the transcriptome data from 70 BioProjects, which were obtained from 837 samples related to rapeseed in the SRA database (Figure 1
). Three types of gene expression values (FPKM, TPM, and read counts) are provided in the BrassicaEDB. Finally, we developed the “eFP”, “Treatment”, “Coexpression”, and “SRA Project” modules based on gene expression profiles and the “Gene Feature”, “qPCR Primer”, and “BLAST” modules based on gene sequences, providing powerful tools for comprehensive gene expression analysis in rapeseed [20
At present, most rapeseed genomes and transcriptomes data are stored in numerous databases [4
]. However, the specific database to integrate gene expression data for Brassica
species, especially for rapeseed, is still absent. Although BnPIR has set up an online web interface to query and visualize gene expression levels, only 40 tissues during flowering stages were included [6
]. In this study, we were committed to building a comprehensive gene expression database for rapeseed. We first collected 103 plant tissue materials, covering almost all the tissues in the rapeseed life cycle, and then downloaded 837 public samples from SRA database for RNA-Seq. Overall, The BrassicaEDB provides rapeseed researchers with comprehensive gene expression profiles and a visual interface, filling a gap in the tools available for exploring gene expression in rapeseed and laying the foundation for obtaining a preliminary understanding of gene function in rapeseed at the transcriptome level.
In the future, we plan to improve several aspects of the BrassicaEDB. First, we plan to add data from additional species to the BrassicaEDB. The genus Brassica
includes several economically important plants. Six species (Brassica carinata
, Brassica juncea
, Brassica napus
, Brassica oleracea
, Brassica nigra
, and Brassica rapa
) evolved by combining chromosomes from three earlier species, as described by the “triangle of U” theory [25
]. Thus, in addition to rapeseed, we will include the gene expression profiles of these five other species. Second, we will expand the expression data and RNA types. The transcriptome data obtained from the public databases can be used for analysis at the gene expression level. Using “Brassica” as the query, 8844 transcriptome samples could be retrieved from the SRA database as of 1 July 2020. Clearly, the bulk of transcriptome data remains to be analyzed and further explored. The availability of big data could provide more comprehensive, accurate information about gene function in the future [26
]. Finally, we plan to expand the analytical tools available in the BrassicaEDB. To help researchers complete their work more efficiently and conveniently, more tools based on gene expression profiles will be provided in the next version of BrassicaEDB.
4. Materials and Methods
4.1. Plant Materials and Growth Conditions
The elite rapeseed cultivar ZS11, which is widely cultivated in China, was selected for developmental transcriptome sequencing. Seeds were germinated in plant growth chambers (PGC Flex; Conviron, Winnipeg, MB, Canada) with a photoperiod of 16 h at 25/18 °C day/night, 60% humidity, and a light level of 250 µmoles/m2/s. After germination, the plants were transplanted to the experimental field of Beibei, Chongqing (29°45′ N, 106°22′ E, 238.57 m, CQ) under natural environments. Each plot contained three rows, with 10 plants per row, 20 cm between plants within each row and 30 cm between rows. During the plant lifecycle, 103 different tissues were collected. These tissues included seedling roots (sRo), hypocotyls (Hy), cotyledons (Co) (24, 48, and 72 h after germination; HAG), and germinating seeds (GS) (12 and 24 HAG); roots (Ro) and mature leaves (ML) at the seedling stage; Ro, stems (St), young leaves (YL), ML, buds (Bu), and inflorescence tips (IT) at the bolting stage; Ro, St, YL, ML, pedicels (Ped), IT, sepals (Sep), petals (Pe), carpels (Ca), stamens (Sta), anthers (An), and filaments (Fi) at the initial bloom and full-bloom stages; ML and YL at 10, 24, and 30 days after flowering (DAF); seeds (Se) and silique pericarps (SP) at 15 and 12 regular intervals between 3 and 46 DAF; embryos (Em) and seed coats (SC) at ten stages of seed development (19 to 49 DAF); inner integuments (InI) at 21 and 24 DAF; and outer integuments (OuI) at 24 and 30 DAF. For each sample, two biological replicates, each obtained from three independent plants, were collected and frozen in liquid nitrogen for RNA-Seq.
4.2. RNA Isolation and Transcriptome Sequencing
Total RNA was extracted from all tissues using an RNAprep Pure Kit for Plant (Tiangen Biotech, Beijing, China) according to the manufacturer’s instructions and stored at −80 °C until use. Two-hundred-and-six libraries were constructed using a TruSeq RNA Library Prep Kit v2 following standard operating procedures (Illumina, https://www.illumina.com/
). All samples were multiplexed per lane of a flow cell, and 125 bp paired-end reads were generated using an Illumina HiSeq 2500 sequencer (Illumina Inc., San Diego, CA, USA).
4.3. Public Data Sources
We downloaded the raw sequencing data for 837 samples in 70 BioProjects from the SRA database using fastq-dump from the SRA Toolkit (https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.7/
), including 169 abiotic stress samples, 211 biotic stress samples, 110 chemical stress samples, 200 developmental samples, and 147 genetic samples, for a total of 2.4 TB of data.
4.4. RNA-Seq Data Analysis
The quality of the RNA sequencing reads was examined using FastQC (v0.11.3) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
). Barcode adaptors from the RNA sequence reads were clipped and low-quality reads removed (read quality < 80 for paired-end reads, read quality < 20 for single end reads) using TRIMMOMATIC software (v0.38) [27
]. RNA sequence reads passing the quality filter were aligned to the rapeseed reference genome v.4.1 [4
], and rapeseed reference annotation v.5.0 was used as a guide for BAM files using STAR software [4
]. Quantification of the expression levels of 101,040 genes in each public sample was performed, generating FPKM, TPM, and Read counts values. Cufflinks were used to generate normalized counts in FPKM [29
]. Reads or fragments were counted from BAM files using featureCounts, and exons were defined as features at the gene level [30
]. The TPM value for each sample was obtained using salmon (v1.0.0) with the parameter “validateMapping” to ensure that all genes would be preserved [31
], without using “decoys” parameter in building the mapping-based index progress.
4.5. System Architecture and Software for Database Construction
The BrassicaEDB was built with softwares of PostgreSQL 9.6 (http://www.postgresql.org
), PHP 7.1 (http://www.php.net
), Apache 2.4 (http://www.apache.org
) and Perl 5.16 (https://www.perl.org
), and all procedures were running on a Linux CentOS 7 (https://www.centos.org
) operation system. The Chado database schema was implemented for storing and managing genomic and transcriptomic data [22
), which supplies many packages and plugins for rendering data and making user interfaces. The BLAST tool was built with NCBI BLAST+ 2.10.1, and it supports searching against selectable and multiple databases [20