Co-Expression Network and Time-Course Expression Analyses to Identify Silk Protein Regulatory Factors in Bombyx mori

Simple Summary Previous studies have reported how the silk production ability of Bombyx mori can be enhanced, but the mechanism that regulates silk protein genes remains unclear. We performed co-expression network analysis using networkz, an in-house program, which led to the identification of 91 transcription factors were co-expressed with silk protein genes. Of them, 13 transcripts were identified to be novel regulatory factors by time-course expression analysis during the fifth instar larvae stage. Their expression patterns were highly relevant to those of silk protein genes. Our results suggest that the two-step expression screening was robust and highly sensitive to screen relative genes, and a complex mechanism regulates silk protein production in B. mori. The novel candidates that were identified herein can serve as key genes to develop methods to enhance the silk protein production ability of B. mori. Abstract Bombyx mori is an important economic insect and an animal model in pharmacomedical research. Although its physiology has been studied for many years, the mechanism via which silk protein genes are regulated remains unclear. In this study, we performed two-step expression screening, namely co-expression network and time-course expression analyses to screen silk protein regulation factors. A co-expression network analysis using RNA-seq data that were obtained from various tissues, including the silk glands of B. mori, was performed to identify novel silk protein regulatory factors. Overall, 91 transcription factors, including some known ones, were found to be co-expressed with silk protein genes. Furthermore, time-course expression analysis during the fifth instar larvae stage revealed that the expression pattern of 13 novel transcription factors was highly relevant to that of silk protein genes and their known regulatory factor genes. In particular, the expression peak of several transcription factors (TFs) was detected before the expression of silk protein genes peak. These results indicated that a larger number of genes than expected may be involved in silk protein regulation in B. mori. Functional analyses of function-unknown transcription factors should enhance our understanding of this system.


Introduction
Silkworms (Bombyx mori) generate silk proteins; they are an economically important insect in sericulture and have proved their value in biotechnology as a bioreactor for the production of recombinant proteins and silk-based biomaterials. Silk proteins can be broadly classified into sericin and fibroin, which are secreted from the middle and posterior with expression patterns that were related to those of target genes were shortlisted as candidates of silk protein regulatory genes. Our results provide insights into how silk protein genes are regulated; moreover, the genes that are discussed herein can be used as targets to improve silk protein production ability.

Constructing a Gene Co-Expression Network and Detecting Modules
We developed a command line tool named networkz to handle large gene co-expression datasets (or gene expression profiles) and to perform co-expression network analysis. networkz was written in C++ and the source code is available at https://github.com/ davecao/networkz.git (accessed on 23 December 2021); it is based on Boost Graph Library v1.70 [https://www.boost.org (accessed on 23 December 2021)] for graph data structure operations and Eigen Library v3. 3.90 [(https://gitlab.com/libeigen/eigen/-/releases (accessed on 23 December 2021)] for matrix operations.
The relationships among genes in the co-expression dataset can be represented by a network, which is an undirected and weighted graph consisting of vertices and edges; herein genes are referred to as vertices while their edges represent the pairwise co-expression measure. To construct an initial co-expression network, we selected a significance measure threshold to determine the connected gene pairs with a significant co-expression relationship, and then modules (or hub genes) that were highly connected with others were detected in the subsequent analysis.
In this study, a gene profile is denoted as a vector with m components; x i = (x i,1 , x i,2 , . . . , x i,m ). Then n gene expression profiles were represented by an n × m matrix; X = (x 1 , x 2 . . . , x n ) T . The expression measure between the genes p and q (d p,q ,) was defined as follows: 2 , i, j = 1, . . . , n, i = j wherein |corr(p, q)| represents the absolute value of Pearson's correlation coefficient between the expression profiles of p and q; x i and x j present mean of x i and x j , respectively. The smaller the value of d p,q is, the higher the likelihood of the two genes (p and q) in the network being interconnected (i.e., showing high correlation in terms of pairwise gene similarity). The threshold of 0.1 was selected via trial and error.
To detect modules in the initially constructed network, we further employed the Kruskal's algorithm [25], as vertices were much more than edges, to find a minimum spanning tree (MST) with minimum sum of edge weights; then, the Louvain method [26] was performed on the MST to assign each gene with a community ID. Finally, modules of interest were found.

Co-Expression Network Analysis
For co-expression network analysis with networkz, we used transcript-level transcripts per million (TPM) values as expression data of two RNA-seq data series, which were used for the assembly and verification of the current reference transcriptome dataset of B. mori in our previous study [24] The first RNA-seq data series (SRA Run ID: DRR068893-068895 and DRR095105-095116) was obtained from the fat body (FB), midgut (MG), MT, whole SG (SG), and TT of the o751 strain last instar larvae on third day (Table 1) [21][22][23]. The second RNAseq data series (SRA Run ID: DRR186474-186503) was obtained from the aforementioned five SG regions (ASG, A-MSG, M-MSG, P-MSG, and PSG), FB, MG, MT, TT, and OV of p50T strain last instar larvae on third day (Table 1) [24]. The transcript-level TPM expression data that were used in this study are available at "expression data of each transcript in multiple tissues" in the study by Yokoi et al. 2021 (doi: 10.18908/lsdba.nbdc02443-002.V001), in which 51,926 transcripts were used as reference sequences for TPM calculation [24].
Herein we used the same transcript ID as that of reference transcript sequences. The silk protein genes (sericin1, sericin2, sericin3, fibroin-H, fibroin-L, and P25) and five existing regulatory genes (SGF1, SGF3, sage, Antp, and Awh) served as target genes. Target network modules containing transcripts (isoforms) of the target genes were identified from network modules that were constructed by networkz. As the target genes showed multiple isoforms, multiple target network modules were identified for each target gene. The transcripts that were annotated with major TF-specific motif in target network modules were screened as candidate TFs.

RNA Extraction
To extract total RNA, fifth instar larva of the w-1 pnd strain of B. mori were kept on an artificial diet (Nihon Nosan Kogyo, Yokohama, Japan) at 25 • C under LD 12:12 h. The SGs of three male and female insects were then extracted every day during the last instar period (day 0-7). Total RNA was isolated from one pair of SGs for each individual using TRIzol (Invitrogen, Carlsbad, CA, USA) and RNeasy Plus Mini Kit (Qiagen, Hilden, Germany), and the wet weight of the whole SG was measured using the other pair of SG.

Gene Expression Analysis
For quantitative RT-PCR (qRT-PCR), cDNAs were synthesized from 500 ng RNA using the Prime Script ® RT reagent kit (Takara, Tokyo, Japan). Elongation factor-2 (EF-2) was used as a reference gene to calculate the relative expression levels [27,28]. Except EF-2, the specific primers were newly designed for each gene using Primer3Plus (Table S1) [29]. The expression levels of each gene were quantified using TB Green™ Premix Ex Taq™ II (Takara, Tokyo, Japan) on a Light Cycler 480 (Roche Diagnostics, Mannheim, Germany). Biological triplicates were subjected to qRT-PCR, and each sample contained cDNA from each tissue of a male and female pair. The relative expression levels of each gene were calculated by adopting the standard curve method. Statistical analysis was performed using ANOVA and the Tukey-Kramer test for comparisons among the last instar period. These statistical analyses were performed using the statistical software Mac Statistical Analysis ver. 2.0.

Co-Expression Network Analysis with Tissue Expression Data
Co-expression network analysis was performed with networkz to detect the candidate genes that regulate silk protein genes or the known regulatory factors of silk proteins. The program (networkz) allocates each transcript to the single most plausible network module. In total, 1022 network modules were generated, and the transcripts of the target genes were identified in 20 network modules ( Table 2, Data S1). Of these, two target genes, P25 and Awh, belonged to the fibroin-L and fibroin-H modules, respectively, whereas four known TFs (SGF1, SGF3, sage, and Antp) were sorted into different modules. Overall, 91 TFs were detected in the above 20 modules. The sericin1 modules, which showed a specific expression pattern in the M-MSG and P-MSG, contained 39 TFs among 565 transcripts. In addition, the sericin2 modules, which showed a specific expression pattern in the A-MSG, contained 11 TFs among 289 transcripts, and the sericin3 module, which showed a specific expression pattern in the whole SG of RNA-seq data-1 and M-MSG, contained two TFs among 36 transcripts ( Figure 1A, Table 2). Although fibroin-H and fibroin-L showed PSGspecific expression patterns, they were separated into different network modules because of differences in TPM values. These modules contained nine TFs among 122 transcripts ( Figure 1A, Table 2). The modules of four known TFs (SGF1, SGF3, sage, and Antp) contained >100 transcripts, including 5-11 TFs ( Figure 1B, Table 2). All the obtained TFs were similarly expressed at one or more tissues with each target gene. Collectively, 91 transcripts were screened as candidate TFs that seem to regulate target gene expression.

Time-Course Expression Analysis of the Four SG Regions during the Last Instar Period
It is notable that the SG developed for seven days, with the wet weight reaching a peak on the fifth day of last instar ( Figure 2A). To narrow down the candidate regulatory genes, we evaluated the time-course expression pattern of TFs that were screened by co-expression network analysis in the four SG regions (A-MSG, M-MSG, P-MSG, and PSG) during the last instar period using qRT-PCR ( Figure 2B-D, Figure S1A, Data S2). The expression levels of 45 TFs, showing high transcript-level TPM values among the sericin1-3, SGF1, SGF3, sage, and Antp modules were quantified in three regions of the MSG. sericin1 was mainly expressed in the M-MSG and its expression level reached a peak on fourth day ( Figures 2E and 3A). Antp was also mainly expressed in the M-MSG, but its expression level reached a peak before that of sericin1 ( Figures 3H and S1B). Similar to Antp, the expression level of five TFs belonging to the sericin1 module (KWMTBOMO00016, KWMTBOMO14062, MSTRG.11166.1, MSTRG.14404.3, and MSTRG.16824.2) including homeobox domain-containing genes ( Table 3) and that of a TF belonging to the Antp module (MSTRG.3176.1) reached a peak before that of sericin1 ( Figure 3A,H). sericin2 was mainly expressed in the A-MSG, and its expression level markedly decreased on the fifth day ( Figures 2E and 3B).  Figure 3B). sericin3 was also mainly expressed in the A-MSG, and its expression level increased over the later period ( Figures 2E and 3C). MSTRG.18671.5 and sericin3 showed similar expression patterns ( Figure 3C). The expression of SGF1 showed a similar pattern among all regions of the MSG, with the expression level decreasing on the fifth day ( Figure S1B). KWMTBOMO08832 belonging to the SGF1 module, showed similar expression pattern to SGF1 ( Figure 3E). SGF3 was primarily expressed in the A-MSG, with its expression peaking on the fifth day ( Figures 3F and S1B). Although the forkhead domain-containing gene KWMTBOMO02915 belonged to the SGF3 module, its expression pattern was similar to that of sericin3 ( Figure 3F, Table 3). sage was also mainly expressed in the A-MSG, and its expression pattern was similar to that of sericin3 ( Figures 3G and S1B). In contrast, KWMT-BOMO12108, belonging to the sage module, showed a high expression level in the earlier period, with its expression level markedly decreasing on the fifth day. This was similar to the pattern that was exhibited by sericin2 ( Figure 3G). Furthermore, the expression levels of nine TFs belonging to the fibroin-H and fibroin-L modules were quantified in the PSG. fibroin-H, fibroin-L, and P25 expression levels were found to be elevated through the last instar period, along with SG development (Figures 2A and 3D). Although both the fibroin modules contained no TFs with expression patterns that were similar to those of fibroin-H and fibroin-L ( Figure 2D), three TFs (MSTRG.11402.4, MSTRG.9312.1, and MSTRG.1346.1) were expressed during the earlier period, in contrast to the pattern that was exhibited by fibroin-H and fibroin-L ( Figure 3D, Table 3). Awh was expressed through the mid-phase of the last instar period ( Figure 2D, Table 3). In total, 17 TFs were eventually detected and found to be related to silk protein genes; they contained not only known regulatory factors such as the Awh isoform PB (MSTRG.1346.1) but also uncharacterized or function-unknown genes ( Table 3). the last instar period ( Figure 2D, Table 3). In total, 17 TFs were eventually detected and found to be related to silk protein genes; they contained not only known regulatory factors such as the Awh isoform PB (MSTRG.1346.1) but also uncharacterized or functionunknown genes (Table 3). Sampling schedule for qRT-PCR during last instar larva (C). The relative expression levels (mean ± SE, biological triplicates) of silk protein genes at each SG region during the last instar period, and a heatmap that was based on the expression of silk protein genes and their module TFs (D). Integrated graphs (mean ± SE, biological triplicates) showing sericin expression at each MSG region (E). Transcript ID is indicated on the right.

Discussion
In previous studies, some genes or gene groups that are specifically expressed in the SG were identified using RNA-seq and microarray [30][31][32]. Despite this, a comprehensive screening strategy is much needed to identify the key factors that regulate silk proteins. Although B. mori has been previously used for co-expression network analysis [19,20], the mechanisms underlying the regulation of silk protein genes remain unclear. Therefore, in this study, we performed co-expression network as well as time-course expression analyses to identify the genes that regulate silk protein genes in B. mori. The co-expression network analysis was performed using an in-house program called networkz; consequently, 20 network modules that were related to 11 target genes were identified. The obtained TFs exhibited tissue expression patterns that were similar to those of each target gene (Figure 1), whereas, the majority of known TFs (SGF1, SGF3, sage, and Antp) formed a module that was distinct from the silk genes, respectively. Although the known TFs are co-expressed with the silk genes in the silk glands, they showed different expression patterns in other tissues, which led to the different modules. The different tissue expression patterns may be due to additional functions of these TFs which are not related with the silk gene regulation in the silk glands. These results indicated that networkz could successfully identify the related transcripts of each target from transcriptome data. sericin1 and sericin2 showed multiple modules as their mRNAs encode multiple isoforms with slightly different expression patterns at the tissue level (Table 2, Figure 1A) [4,[33][34][35]. It, therefore, seems possible that diverse genes regulate sericin1 and sericin2 expression.
Time-course expression analysis led to the identification of 17 TFs that showed specific expression patterns and were related to target genes in the MSG and PSG during the last instar period (Figures 2 and 3). The sericin1 module contained two homeobox domain-containing genes (MSTRG.14404.3 and MSTRG.16824.2), the expression of which appeared before the expression of sericin1 peaked ( Figure 3A). MSTRG.14404.3 possessed a homothorax (Hth)-like motif (Table 3). Hth is a known cofactor of Antp and is thus related to regulating sericin1 expression [4]; therefore, it appears that MSTRG.14404.3 is also involved in sericin1 expression regulation. Although SGF3, as with SGF1, is also involved in regulating sericin1 expression [8][9][10], it is possible that KWMTBOMO02915 ( Figure 2D) regulates sericin3 expression as its expression pattern was similar to that of sericin3 in the A-MSG ( Figure 3C,F). Furthermore, KWMTBOMO02915 was already recognized as MSG-specific expression TF in a previous study [24]. The expression level of the histone superfamily gene KWMTBOMO12108 decreased in the later period of last instar, and it was similar to that of sericin2. It has been reported that 20-hydroxyecdysone (20E) titer increases in the later period of last instar [36], and that 20E treatment has a repressive effect on histone gene expression [37]. Hence, it is possible that KWMTBOMO12108 regulates sericin2 expression via 20E titer transition. The fibroins modules contained MSTRG.11402.4, MSTRG.9312.1, and MSTRG.1346.1, which showed high expression levels during the earlier period, in contrast to the fibroins expression pattern. MSTRG.11402.4, MBF2 partial transcript, is reportedly involved in fibroin-H expression regulation [38] and is also involved in nuclear transport in the SG along with FTZ-F1 [39]. Although Awh isoform PA (KWMTBOMO00651) and Awh isoform PB (MSTRG.1346.1) belonged to the fibroins modules, their expression patterns were different during the last instar period (Figures 2D and 3D). Besides, although the TFs KWMTBOMO02915 and KWMTBOMO12108 showed similar expression patterns with their target genes at the tissue level ( Figure 1A), different expression patterns from their target genes were observed in the time-course expression ( Figure 3F,G). These results suggested that when designing a screening strategy, including both co-expression network and time-course expression analyses is pivotal. As stated earlier, the TFs MSTRG.11402.4, MSTRG.14404.3, MSTRG.1346.1, and KWMTBOMO02915 are known to be related to silk protein genes, while 13 novel function-unknown TFs were recognized as candidates of silk proteins regulation factor. Herein we performed time-course expression analysis to screen related TFs by qRT-PCR focusing on the candidates. Extending this approach to co-expression network analysis using RNA-seq data will help to provide insights into full extent of silk protein genes regulation.

Conclusions
In this study, silk protein regulatory genes in B. mori were identified using a two-step screening strategy. In the first step, 20 network modules including 91 TFs were screened by co-expression network analysis using the in-house program networkz, and in the second step, 17 transcripts were screened as silk protein-related genes by time-course expression analysis of the MSG and PSG during the last instar period. Since four of these TFs were already known to be related with the silk gene, we found 13 TFs as candidates for novel silk regulatory factors. As we identified both known as well as function-unknown TFs, we believe that our strategy is robust and highly sensitive to screen relative genes. Furthermore, screening results indicated that a larger number of genes than expected may be involved in silk protein regulation in B. mori. Functional analyses of function-unknown TFs should further our understanding of the mechanisms underlying silk protein regulation.
Supplementary Materials: The following materials are available online at figshare. Figure S1: The relative expression levels of known regulatory factors at each SG region during the last instar period, and a heatmap that was based on the expression of known regulatory factors and their module TFs. (doi:10.6084/m9.figshare.17141888). Table S1: Primer sets that were used in this paper (doi:10.6084/m9.figshare.17141969). Data S1: Output raw data of co-expression network analysis using networkz (doi:10.6084/m9.figshare.17427359). Data S2: Relative expression levels of all tested transcripts (doi:10.6084/m9.figshare.17427362).