Identification of Genes that Control Silk Yield by RNA Sequencing Analysis of Silkworm (Bombyx mori) Strains of Variable Silk Yield

Silk is an important natural fiber of high economic value, and thus genetic study of the silkworm is a major area of research. Transcriptome analysis can provide guidance for genetic studies of silk yield traits. In this study, we performed a transcriptome comparison using multiple silkworms with different silk yields. A total of 22 common differentially expressed genes (DEGs) were identified in multiple strains and were mainly involved in metabolic pathways. Among these, seven significant common DEGs were verified by quantitative reverse transcription polymerase chain reaction, and the results coincided with the findings generated by RNA sequencing. Association analysis showed that BGIBMGA003330 and BGIBMGA005780 are significantly associated with cocoon shell weight and encode uridine nucleosidase and small heat shock protein, respectively. Functional annotation of these genes suggest that these play a role in silkworm silk gland development or silk protein synthesis. In addition, we performed principal component analysis (PCA) in combination with wild silkworm analysis, which indicates that modern breeding has a stronger selection effect on silk yield traits than domestication, and imply that silkworm breeding induces aggregation of genes related to silk yield.


Introduction
Domestic silkworm (Bombyx mori) is an economically important insect that was domesticated more than 5000 years ago. In China, the total direct income of silkworm farmers was 111.54 billion yuan in the past five years, the output value of enterprises above a designated size was 641.3 billion yuan, and the export value of real silk products was 16.6 billion US dollars [1]. In addition, silk fibroin has recently been used as a biomaterial and biomedicine for the development and utilization of bone tissue engineering, silk fibroin hydrogels, implant coatings, and nanomaterials [2][3][4][5][6]. The huge industrial value and broad application prospects of silk has thus prompted researchers to develop techniques to improve its yield. Today, traditional breeding methods have been unable to further improve silk quantity. It is thus imperative to use advanced molecular biology methods to identify genes that control silk yield.
Silk traits are quantitative traits that are regulated by multiple genes, and their genetic mechanisms are complex. Currently, quantitative trait loci (QTLs) for multiple cocoon quality traits have been mapped, laying the foundation for gene detection [7][8][9][10][11][12]. Our group previously developed a suitable

Phenotypic Assessment of Cocoons of Various Silkworm Strains
This study used three high-yield strains (872B, Qiufeng, and Xiafang) and three low-yield strains (19-200, 10-710, and 19-460) with significantly different cocoon quantity traits (Figure 1). The three high-yield strains adopted in this study are excellent and practical strains, which are widely used in breeding [19][20][21], and the average CSW is about three times of the selected low-yield strains (Table 1). We investigated the traits of whole cocoon weight, CSW, and pupa weight of 12 silkworm strains, including sequenced strains of males and females (Table 1). Significant differences in cocoon traits were observed among strains. CSW was about four-fold higher in the high-yield strains than in the low-yield strains. In addition, the high-yield strains had two-to three-fold higher cocoon shell rates than the low-yield strains. Silk glands showed different development rates among strains and developmental stages. The developmental rate of the silk glands of strain 19-200 peaked on the day before mounting (Figure 2). We collected samples at this time point to ensure that the majority of genes associated with silk traits were upregulated. Furthermore, different silkworm silk glands have different lengths of development. The accurate selection of the dissection time-point is thus critical to ensure that the silk glands of each strain are at the same stage of development. Therefore, to ensure accuracy of the sequencing data, the fifth instar (5.5 days) of strain  was used as the standard, and its length was then used in calculating the dissection time-point (Table 2) of each strain.

Overview of Transcriptome Sequencing Data
For each strain, we randomly selected three larvae, from which the entire silk gland was isolated and RNA was extracted. RNA sequencing was performed using the Illumina HiSeq TM 4000 system [22]. After filtering the data for adaptor sequences, unknown sequences (N), and low-quality reads, 45.9 gigabases (Gb), 46.8 Gb, and 45.1 Gb, 46.5 Gb, 44.7G, and 46.9G of clean reads were obtained for strains, 10-710, 19-200, 19-460, 872B, Xia Fang, and Qiu Bai, respectively. We compared the clean reads with the silkworm reference genome, and the percentage of total reads in the six transcriptome databases ranged from 52.1% to 72.3% (Table 3). The average number of clean reads was about 46 million. Among all strains, 29.6%-40.84% of the clean reads were aligned to about 9,000 genes, of which strain 872B had the highest number of genes, in which 9,397 reads could be aligned to the silkworm reference genome (Table S1). The number of new transcripts for each strain ranged from 842 to 981 (Table S2).

Overview of Transcriptome Sequencing Data
For each strain, we randomly selected three larvae, from which the entire silk gland was isolated and RNA was extracted. RNA sequencing was performed using the Illumina HiSeq TM 4000 system [22].
After filtering the data for adaptor sequences, unknown sequences (N), and low-quality reads, 45.9 gigabases (Gb), 46.8 Gb, and 45.1 Gb, 46.5 Gb, 44.7G, and 46.9G of clean reads were obtained for strains, 10-710, , 872B, Xia Fang, and Qiu Bai, respectively. We compared the clean reads with the silkworm reference genome, and the percentage of total reads in the six transcriptome databases ranged from 52.1% to 72.3% (Table 3). The average number of clean reads was about 46 million. Among all strains, 29.6%-40.84% of the clean reads were aligned to about 9,000 genes, of which strain 872B had the highest number of genes, in which 9,397 reads could be aligned to the silkworm reference genome (Table S1). The number of new transcripts for each strain ranged from 842 to 981 (Table S2).

Differential Expressed Gene Function Annotation and Functional Analysis
The expression of all genes was estimated using the RPKM (reads per kb per million reads) method [23]. The following comparison strategy for screening DEGs was adopted in this study ( Figure 3). By comparing the three high-yield strains with the three low-yield strains, nine groups of DEGs were obtained. The edgeR function in Bioconductor was employed for differential expression analysis [24]. All the DEGs are presented in Table 4. The number of DEGs in each group varied from 8 to 261 (Table S3). The gene sequences were aligned to the NR (Non-Redundant Protein Sequence Database) library using BLAST to extract DEG annotation information. The topGO function in the software, Bioconductor, was employed for DEG enrichment analysis [25].

Differential Expressed Gene Function Annotation and Functional Analysis
The expression of all genes was estimated using the RPKM (reads per kb per million reads) method [23]. The following comparison strategy for screening DEGs was adopted in this study ( Figure 3). By comparing the three high-yield strains with the three low-yield strains, nine groups of DEGs were obtained. The edgeR function in Bioconductor was employed for differential expression analysis [24]. All the DEGs are presented in Table 4. The number of DEGs in each group varied from 8 to 261 (Table S3). The gene sequences were aligned to the NR (Non-Redundant Protein Sequence Database) library using BLAST to extract DEG annotation information. The topGO function in the software, Bioconductor, was employed for DEG enrichment analysis [25]. . Experimental comparison strategy design. By comparing the three high-yield strains with the three low-yield strains, nine groups of DEGs were obtained. Differentially expressed in more than four of the nine groups of DEGs were defined as common DEGs. The key genes were obtained by screening the common DEGs through quantitative detection of each strain in silk glands. Finally, the association analysis with the cocoon shell weight (CSW) was conducted in key genes.
For functional annotation of the DEGs, GO enrichment analysis was performed. GO terms with a corrected p < 0.05 were considered significantly enriched with DEGs. GO functional enrichment analysis of DEGs of the six strains revealed significant enrichment in the categories of biological process and molecular function. Further assessment revealed that the annotation information of nine DEGs were similar. In the functional category of biological process, significant DEG enrichment . Experimental comparison strategy design. By comparing the three high-yield strains with the three low-yield strains, nine groups of DEGs were obtained. Differentially expressed in more than four of the nine groups of DEGs were defined as common DEGs. The key genes were obtained by screening the common DEGs through quantitative detection of each strain in silk glands. Finally, the association analysis with the cocoon shell weight (CSW) was conducted in key genes.
For functional annotation of the DEGs, GO enrichment analysis was performed. GO terms with a corrected p < 0.05 were considered significantly enriched with DEGs. GO functional enrichment analysis of DEGs of the six strains revealed significant enrichment in the categories of biological process and molecular function. Further assessment revealed that the annotation information of nine DEGs were similar. In the functional category of biological process, significant DEG enrichment between the high-and low-yield strains was observed for the subcategories of cellular process, metabolic process, and single-organism process. For the functional category of cellular components, DEG enrichment was also detected in the subcategories of extracellular and extracellular regions. In terms of molecular function, DEG enrichment was observed in the subcategories of binding and catalytic activity ( Figure S1). BLAST analysis of the DEGs to the KEGG database was performed to obtain annotation information relating to pathways [26]. Based on the resulting list of DEGs, a hypergeometric test was used to perform pathway enrichment analysis of DEGs, which showed that the endoplasmic reticulum pathway in protein processing of metabolic pathways was enriched with DEGs between the high-and low-yield silkworms ( Figure S2). Figure 4 shows the DEGs identified from the comparison of multiple groups. Twenty genes that were differentially expressed in more than four of the nine groups of DEGs were defined as common DEGs (Table 4), which included nine genes that were upregulated in low-yield strains, and 11 genes that were upregulated in high-yield strains. Five of the 20 common DEGs were detected in at least five groups of BLAST data.    We validated five significant common DEGs in the MGs and PGs of the strains that were used for sequencing by qPCR ( Figure 5), and the results coincided with the RNA-seq data. BGIBMGA004399, BGIBMGA009092, and BGIBMGA009093 genes in the MGs, and BGIBMGA005780 and BGIBMGA003330 genes in the PGs were differentially expressed between the high-and low-yield strains. We further analyzed the association between the five significant common DEGs and silk yield traits in the 12 strains with different silk yields by qPCR ( Figure S3 and Figure 6), which indicated that the expression of BGIBMGA003330 and BGIBMGA005780 in the silk gland is closely related to the silk fibroin generated ( Figure 6). and BGIBMGA003330 genes in the PGs were differentially expressed between the high-and low-yield strains. We further analyzed the association between the five significant common DEGs and silk yield traits in the 12 strains with different silk yields by qPCR ( Figures S3 and 6), which indicated that the expression of BGIBMGA003330 and BGIBMGA005780 in the silk gland is closely related to the silk fibroin generated (Figure 6).

PCA of Wild Silkworm, Local Strains, and Practical Strains
Cocoon quality traits undergo two stages of selection, namely, the domestication process from wild silkworm strains to local strains, and the breeding process from local species to practical species. PCA was employed to detect differences in gene expression patterns among various strains by analyzing similarities in major strain components. Thus, differences in selection for silkworm cocoon

PCA of Wild Silkworm, Local Strains, and Practical Strains
Cocoon quality traits undergo two stages of selection, namely, the domestication process from wild silkworm strains to local strains, and the breeding process from local species to practical species. PCA was employed to detect differences in gene expression patterns among various strains by analyzing similarities in major strain components. Thus, differences in selection for silkworm cocoon quality traits in these two processes can be inferred. In this study, the data of three local strains and three practical strains were combined with two wild silkworm data from other studies [16] to conduct PCA. The results showed that the main components of the two wild silkworm strains and the three local strains were distributed, whereas those of the practical strains were clustered together (Figure 7). Compared to the domestication process, modern breeding imparts stronger selection pressure on genes that are related to silk yield, so that silkworms of different high-yield strains exhibit similar expression patterns. The common genes that are highly expressed in the three high-yield silkworm strains are likely to be important genes that control silk yield. Based on these results, we identified genes that are highly expressed in all three high-yield strains, followed by functional annotation ( Table 4). Most of these genes are ribosomal proteins and structural proteins, which presumably play an important role in the biosynthesis of silk protein.
genes that are related to silk yield, so that silkworms of different high-yield strains exhibit similar expression patterns. The common genes that are highly expressed in the three high-yield silkworm strains are likely to be important genes that control silk yield. Based on these results, we identified genes that are highly expressed in all three high-yield strains, followed by functional annotation ( Table 4). Most of these genes are ribosomal proteins and structural proteins, which presumably play an important role in the biosynthesis of silk protein.

Discussion
In this study, multiple silkworm strains with different cocoon colors were used as research materials. By screening the common DEGs, the background interference caused by the particularity of a single strain, such as the genes controlling cocoon color, could be better avoided, making the results more accurate and reliable. RNA-seq analysis of various silkworm strains identified up to 261 DEGs in each pair of samples, and 22 common DEGs were screened. These common DEGs mostly function in energy metabolism and protein synthesis. The results suggest that the energy metabolic rate is related to silk gland development. In addition, protein synthesis affects silk protein synthesis efficiency, thus affecting silkworms' silk production.
Common DEGs were identified by association analysis, and gene BGIBMGA003330 and BGIBMGA005780 were significantly associated with CSW. From the results of quantitative verification, we can see that the expression levels of genes BGIBMGA003330 and BGIBMGA005780 presented significant differences between the high-yield and low-yield strains in the PGs. The PGs

Discussion
In this study, multiple silkworm strains with different cocoon colors were used as research materials. By screening the common DEGs, the background interference caused by the particularity of a single strain, such as the genes controlling cocoon color, could be better avoided, making the results more accurate and reliable. RNA-seq analysis of various silkworm strains identified up to 261 DEGs in each pair of samples, and 22 common DEGs were screened. These common DEGs mostly function in energy metabolism and protein synthesis. The results suggest that the energy metabolic rate is related to silk gland development. In addition, protein synthesis affects silk protein synthesis efficiency, thus affecting silkworms' silk production.
Common DEGs were identified by association analysis, and gene BGIBMGA003330 and BGIBMGA005780 were significantly associated with CSW. From the results of quantitative verification, we can see that the expression levels of genes BGIBMGA003330 and BGIBMGA005780 presented significant differences between the high-yield and low-yield strains in the PGs. The PGs are used to produce silk fibroin. Silk sericin, wrapped in the outer layer of silk fibroin to protect and bind it, is water-soluble and soluble in water, acid, and alkali solutions. However, silk fibroin only swells in water, and is insoluble in water. In the process of production and utilization, silk fibroin is separated by sericin melting during the process of filature, which is the main material for silk products. It can be seen that the increase of silk fibroin production is the core of the increase of silk yield. We speculate that BGIBMGA003330 and BGIBMGA005780 may be related to silk fibroin synthesis. Genes BGIBMGA003330 and BGIBMGA005780 encode uridine nucleosidase and small heat shock protein (sHSP), respectively. Uridine nucleosidase plays a fundamental role in the interconversion of pyrimidine nucleotides and in the reutilization of pyrimidine nucleosides. Uridine nucleosidase is used to decompose uridine into uracil and ribose [27], then uracil is incorporated into the mononucleotide, UMP (Uridine monophosphate) [28]. UMP is the composition of nucleic acid involved in the basic life activities of organisms, such as heredity, development, and growth [29]. Ribose, of fundamental importance in the pyrimidine nucleotides' biosynthesis from exogenously supplied or endogenously formed bases and nucleosides, is involved in the biosynthesis of storage compounds and the synthesis of structural building compounds [30]. The role of this gene in development and protein synthesis may be related to the development of silk glands or silk protein synthesis, thereby affecting the silk yield.
Small heat shock protein is a molecular chaperone with strong anti-aggregation properties. Previous studies have shown that sHSP enhances heat tolerance [31], protects cells from stress [32], prevents protein misfolding [33], restores the natural concept of unfolded or partially folded peptides [34], alters mitochondrial metabolism [35], controls mitochondria protein quality to extend the lifespan [36], influences the rate of embryonic development, and prevents spontaneous diapause [37]. sHSP are not only expressed independently in the heat shock response, but also play an important role in some animal development processes, which may be related to silk gland development, such as regulating cell movement [38]; regulating intermediate filament assembly [39]; inhibiting apoptotic signaling [40]; binding to certain kinases to activate and protect them from heat inactivation [41]; and allowing cancerous cells to escape the immunosurveillance mediated by death ligands [42]. There is no evidence related to silk gland development in the current report, but in our study, it was found that this gene is significantly associated with the silkworm CSW. Moreover, there is also evidence that sHSP maintains protein structure and enhance the growth of tumors in vivo [42]. So, it is also possible that these may also have other functions that are related silk gland development that have yet to be discovered.
Besides, the common DEGs, BGIBMGA004399 and BGIBMGA008165, encode a 30K protein that is involved in embryonic development [43], energy storage [44], and the immune response [45] in the silkworm, plays an important role in energy metabolism, and contributes to inhibiting apoptosis [46]. We postulate that these genes affect the rate of silk protein synthesis by regulating energy metabolism, thereby affecting silk yield traits.
In this study, PCA results attracted our attention. PCA indicated that low-yield strains significantly vary in terms of gene expression patterns, whereas high-yield strains have highly similar expression patterns ( Figure 6). Previous studies have shown that the linkage groups that control the QTLs for different parents overlap [47][48][49][50], suggesting that unlike the long domestication process, modern genetic breeding has a strong selective effect on silk yield traits. The rapid accumulation of superior genes for cocoon traits or the generation of new genes that increase yield can lead to higher breeding values in the short term. Genes that are highly expressed in the three high-yield strains are likely to be important genes that control silk yield. We have identified that genes that are highly expressed in the three high-yield strains are relatively expressed at low levels in the three low-yield strains, and were functionally annotated. Most of these genes are ribosomal proteins and structural proteins, which may play an important role in the biosynthesis of silk protein.
In summary, we have described an approach to circumvent the effect of background differences using multi-strand RNA-seq alignments to more accurately screen for genes related to silk yield. The differentially expressed genes identified in this study may serve as the object of the functional study, and it is necessary to deeply study its function and the mechanism of affecting silk yield.

Silkworm Breeding and Sample Preparation
The selected materials were six silkworm strains for RNA-seq, including three high-yield strains (Qiufeng, Xiafang, and 872B), and three low-yield strains (10-710, 19-200, and 19-460), and 12 strains (Table 1) with different cocoon traits and silk yields for correction analysis. The silkworms were obtained from the silkworm resource bank of Southwest University, Chongqing, China. The larvae were fed fresh mulberry leaves and maintained at conditions of sTable 14 h light and 10 h dark photoperiod at 25 ± 1 • C and 75% ± 3% RH. Intact silk glands were dissected and frozen immediately in liquid nitrogen and then stored at −80 • C for sequencing. In this study, three samples were prepared as biological replicates.
The silkworm cocoons of different strains were collected for phenotypic assessment, and these silkworms were reared at the same time and in the same conditions. The cocoons of the low-yield silkworms (19-460, 10-710, and 19-200) were much smaller than the high-yield strains (872B, Qiufeng, and Xiafang). The statistical results of CSW show significant differences between the low-and high-yield strains.

RNA Library Construction and Sequencing
Total RNA was extracted using TRIzol reagent according to the manufacturer's recommendations (Invitrogen, Burlington, ON, Canada), and DNA was digested with DNase I. Eukaryotic mRNAs are enriched with oligo(dT) magnetic beads. Using a thermomixer, the mRNAs were sheared into short fragments under the appropriate temperature, and a strand of cDNAs were synthesized using the interrupted mRNAs as template. Then, a two-stranded synthesis reaction system was used to synthesize the two-stranded cDNA, and kits were used for purification and recovery, cohesive end-repair, adding the base "A" to the 3 end of the cDNAs, and attaching the adapter, followed by fragment size selection and PCR amplification. The constructed library was qualified with the Agilent 2100 Bioanalyzer and the ABI StepOne Plus Real-Time PCR system and then sequenced using the Illumina HiSeq TM 4000. The data from the Illumina HiSeq TM 4000 sequence were designated as raw reads or raw data [22], which were then subjected to quality control (QC) to determine that the sequencing data is suitable for subsequent analysis. After QC, the raw reads were filtered to obtain clean reads, and a SOAP aligner/SOAP2 [51] was used to compare the clean reads to the reference sequence. A BLAST search was then performed to determine the distribution and coverage of reads on the reference sequence to assess whether the comparison results passed the second QC of alignment, thereby ensuring that subsequent analysis is based on high-quality clean reads.
Reference sequence and gene model annotations were downloaded from the Silkworm Genome Database (SilkDB; http://www.silkdb.org/silkdb/) [52]. We compared the clean data of each sample with the reference gene set of the silkworm using the short reads software, SOAPaligner/SOAP2 [51], and compared the clean data of each sample with the reference genome using TopHat (http://ccb.jhu. edu/software/tophat/index.shtml) [53]. The results were then used in the analyses for alternative splicing and prediction of new transcripts.

Quantitative and Differential Analysis of Transcripts
We used the reads per kilobase per million reads (RPKM) [23] method to estimate gene expression levels. The differential expression of two samples was screened using the edgeR [24] function in Bioconductor. The edgeR function assumes that the count of sequencing reads is a negative binomial distribution for each gene. Hypothesis testing was performed based on this theoretical distribution. The differentially expressed gene screening conditions were set to FDR ≤ 0.05 |log 2 ratio| ≥ 1 [54]. Genes with similar expression patterns usually have similar functions. To analyze gene expression patterns, we used the R software and used the Euclidean distance as the distance calculation formula to perform hierarchical clustering of DEGs and experimental conditions.

Differential Gene Function Annotation and Functional Analysis
We obtained the reference sequence for all silkworm genes from the NCBI database and used BLASTX for protein annotation. Then, we employed the BLAST software to align the gene sequences to the NR library and extracted the GO annotation information for all genes from the Gene Ontology database (http:/www.geneontology.org/). Based on the list of DEGs, we performed GO enrichment analysis of DEGs using the topGO function in the software, Bioconductor [25]. The KEGG analysis [26] involves a BLAST search of genes in the KEGG database to annotate pathway information on the DEGs.
Then, based on the list of DEGs obtained, the pathway enrichment analysis of DEGs was performed using hypergeometric testing to find the pathway that was significantly enriched with DEGs compared to the entire genomic background.

Expression Quantity Identification of DEGs
To verify RNA-seq data, we used real-time quantitative PCR (qPCR) to identify DEGs for expression quantity identification. Intact silk glands were dissected according to the timetable for sample preparation. In addition, other tissues of the fifth instar 3-day larvae of 19-200 and 872B, including the head, epidermis, midgut, Malpighian tubule, fat body, hemolymph, middle silk gland, posterior silk gland, trachea plexus, testis, nerves, and ovaries, were dissected for tissue expression identification of DEGs. Total RNA was extracted using the RNApure ultrapure total RNA Rapid Extraction Kit (BioTek) strictly following the manufacturer's instructions. Ultraviolet spectrophotometry was used to determine RNA integrity and purity. The cDNA was extracted using the PrimeScript RT reagent kit with a gDNA Eraser kit, and cDNA purity was assessed by PCR using primers specific to Actin3 (A3), a silkworm housekeeping gene. The CDS sequences of the confirmed genes were extracted from the Silkworm Genome Database (http://www.silkdb.org/silkdb/) [42] and used in designing primers to span at least one intron. qPCR was performed using the CFX96TM Real-Time PCR Detection System (Bio-Rad, Hercules, CA, USA) and SYBR Green qPCR Mix (Bio-Rad) reagents. qPCR was conducted in a reaction volume of 10 µL containing 1 µL of the template, 5 µL of 2 × SYBR Green II, 5 µL of ddH 2 O, and 0.3 µL of the specific primers. The PCR conditions were as follows: Pre-denaturation at 95 • C for 30 s; followed by 40 cycles of denaturation at 95 • C for 5 s and annealing at 60 • C for 30 s; and a melting curve at 65 • C for 5 s, with a stepwise temperature increase of 0.5 • C until 95 • C. The volume of the reaction system was 10 µL, and three technical replicates were used per sample. The expression levels of each gene of the two strains were compared based on a t-test. Differences in gene expression between high-and low-yield strains were considered significantly different at p < 0.05.

Conflicts of Interest:
The authors declare no conflict of interest.

DEGs
Differentially expressed genes PCA Principal component analysis QTL Quantitative trait loci CSW Cocoon shell weight