Precise Characterization of Bombyx mori Fibroin Heavy Chain Gene Using Cpf1-Based Enrichment and Oxford Nanopore Technologies

Simple Summary Bombyx mori (B. mori), an important economic insect, is famous for its silk. B. mori silk is mainly composed of silk fibroin coated with sericin. Among them, the silk fibroin heavy chain protein has the highest content and the largest molecular weight, which is encoded by the silk fibroin heavy chain (FibH) gene. At present, apart from the complete sequence of the FibH of the B. mori strain p50T, there are no other reports regarding this protein. This is mainly because the special structure formed by the GC-rich repetitive sequence in FibH hinders the amplification of polymerase and the application of Sanger sequencing. Here, the FibH sequence of Dazao, which has 99.98% similarity to that of p50T, was obtained by means of CEO. As far as we know, this is the first complete FibH sequence of the Chinese B. mori strain. Additionally, the methylated CG sites in the FibH repeat unit were identified. Abstract To study the evolution of gene function and a species, it is essential to characterize the tandem repetitive sequences distributed across the genome. Cas9-based enrichment combined with nanopore sequencing is an important technique for targeting repetitive sequences. Cpf1 has low molecular weight, low off-target efficiency, and the same editing efficiency as Cas9. There are numerous studies on enrichment sequencing using Cas9 combined with nanopore, while there are only a few studies on the enrichment sequencing of long and highly repetitive genes using Cpf1. We developed Cpf1-based enrichment combined with ONT sequencing (CEO) to characterize the B. mori FibH gene, which is composed of many repeat units with a long and GC-rich sequence up to 17 kb and is not easily amplified by means of a polymerase chain reaction (PCR). CEO has four steps: the dephosphorylation of genomic DNA, the Cpf1 targeted cleavage of FibH, adapter ligation, and ONT sequencing. Using CEO, we determined the fine structure of B. mori FibH, which is 16,845 bp long and includes 12 repetitive domains separated by amorphous regions. Except for the difference of three bases in the intron from the reference gene, the other sequences are identical. Surprisingly, many methylated CG sites were found and distributed unevenly on the FibH repeat unit. The CEO we established is an available means to depict highly repetitive genes, but also a supplement to the enrichment method based on Cas9.


Introduction
Tandem repeat is a common repetitive DNA sequence in eukaryotes, which are arranged consecutively in the genome, and has been called junk or even selfish DNA because most of these DNA sequences cannot encode proteins [1,2]. In fact, these "junk" DNAs are

Genomic DNA Extraction
To reduce the cost and to increase the yield of gDNA extraction, we adopted the classical phenol extraction method. Briefly, a B. mori pupa ground in liquid nitrogen was incubated in an extraction buffer (1 M Tris-HCl pH 8.0, 0.5 M EDTA pH 8.0, 0.5% SDS) with RNase A at 37 • C for 1 h, and it was then digested with proteinase K (100 µg/mL) at 55 • C overnight to completely degrade the proteins. Next, it was extracted twice with saturated phenol, once with Tris saturated phenol-chloroform-isoamyl alcohol (25:24:1), and once with chloroform. Finally, ethanol was added to the extracted supernatant to precipitate the gDNA. Carotenoids are often precipitated with B. mori pupal gDNA. Therefore, to obtain high purity gDNA, we repeated the extraction process again and extracted the gDNA three times with phenol and chloroform. The DNA pellet was resuspended overnight with EB (1 M Tris-HCl pH 8.0, 0.5 M EDTA pH 8.0) at 4 • C, and it was then centrifuged at 10,080 g for 10 min at 4 • C to obtain the supernatant and a viscous lower suspension. The former was transparent, while the latter was turbid and viscous. The DNA concentration in the supernatant and the DNA suspension was determined using Nanodrop and Qubit. The supernatant DNA was used for the subsequent experiments.

PCR
The forward and reverse primers (Table S1) were synthesized by Tsingke (Beijing, China) and were dissolved in deionized water to amplify the upstream and downstream regions of the 4 crRNA targeting sites (Supplementary Notes S1-S4). PCR was performed using 160 ng B. mori pupal gDNA as DNA templates in a 40 µL reaction mixture containing 20 µL PrimeSTAR Max Premix (2×) (TAKARA, Otsu, Japan, Cat. R045A) and 1.2 µL 10 uM forward and reverse primers. The thermal cycle program was as follows: 98 • C for 4 min, 30 cycles of 98 • C for 10 s, 55 • C for 15 s, 72 • C for 30 s, and 5 min of final extension at 72 • C, using a S1000 Thermal Cycler (Bio-Rad, Hercules, CA, USA). Each PCR product was mixed with 8 µL 5× DNA Loading Buffer with GelRed (Biomed, Beijing, China, EL107-01), separated by 1% agarose gel via electrophoresis, and purified using a gel extraction kit (OMEGA, Biel, Switzerland, Cat. D2500-02) according to the manufacturer's manual. The purified DNA was immediately cleaved by Cpf1/crRNA or stored at −20 • C.
To verify the availability of the four Cpf1/crRNA complexes, the purified DNA was cleaved by the Cpf1/crRNA complex for 1 h at 37 • C and then separated by 1% agarose gel via electrophoresis. The cleavage reaction was composed of 1 µL Cpf1/crRNA complex, 1 µL purified DNA, 1 µL 10× cutsmart buffer, and 7 µL DNase/RNase-free water. The DNA bands were analyzed by ImageJ (Available online: https://imagej.nih.gov/ij/ (accessed on 1 January 2020)) to estimate the cleavage efficiency of the four Cpf1/crRNA complexes.

Data Analysis
Based on the silkworm reference genome (Available online: http://silkbase.ab.a.utokyo.ac.jp/cgi-bin/index.cgi (accessed on 5 November 2019)) [48], the crRNA sequences were designed using the online version of CCTop (Available online: http://crispr.cos.uniheidelberg.de/ (accessed on 1 December 2019)) [49]. In addition, the selected crRNAs were mapped against the silkworm reference genome with bowtie2 (v2.3.5) [50] to check the degree of sequence specificity. Nanopore raw FAST5 reads were base called used Guppy (Available online: https://pypi.org/project/ont-pyguppy-client-lib/ (accessed on 2 March 2020)) to obtain the original data. The reads with a quality value less than 7 were filtered out, and high-quality data (quality value greater than or equal to 7) were used for subsequent analysis. The ONT Reads were aligned to the silkworm reference genome using Minimap2 (v 2.17-r941) [51], and then samtools (v1.9) [52] was used to extract the reads that were aligned to the target region and the number of reads was counted. Canu (v1.7) [53] was used to assemble the reads that mapped to the target area. After assembly, Medaka software (Available online: https://github.com/nanoporetech/medaka (accessed on 23 March 2020)) was used to align the ONT reads to the assembled sequence for error correction on the assembled sequence to obtain a consensus sequence. Based on the reference genome sequence, nanopolish software (Available online: https://github.com/shiliyan/nanopolish (accessed on 15 April 2020)) was used to detect the methylation of the target region.

Cpf1-Based Enrichment and ONT Sequencing (CEO) Overview
To precisely characterize large, highly repetitive, and less complex genes, we established the CEO strategy ( Figure 1). The strategy was divided into four steps: dephosphorylation, Cpf1 digestion, adapter ligation, and sequencing analysis. Briefly, B. mori gDNA was dephosphorylated with CIAP to block the 5 end of all the linearized DNA. The dephosphorylated gDNA was then digested with the Cpf1/crRNA complex to release the target region. Then, the sticky ends left by Cpf1 cleavage were filled by a poly-A tail, and the adaptors were ligated. Finally, the complete sequence of the target gene was obtained through ONT sequencing and bioinformatic analysis.

Preparation of High-Quality B. mori gDNA and High Activity crRNA
ONT sequencing is known for its long read-length, which is determined by the integrity of the gDNA. To obtain high-quality gDNA, it was extracted using phenolchloroform from a B. mori pupa. Because carotenoids are easily extracted from B. mori pupae along with gDNA and affect subsequent experiments, they were therefore removed by repeating the extraction process. Suspension and supernatant DNA were obtained by high-speed centrifugation at 4 • C, and the supernatant DNA was used for subsequent enrichment. The DNA fragments were 23-200 kb (Figure 2A), which spanned the entire FibH (KWMTBOMO15365) and met the needs for ONT library construction. tivity is affected by the DNA environment and crRNA sequence [56]. To obtain highly active crRNA, four crRNAs, FibH-up 1, FibH-up 2, FibH-down 1, and FibH-down 2, which target the upstream and downstream of FibH ( Figure 2B), were designed, and their efficiency score was predicted using CHOPCHOP [57]( Table 1). The residual DNA after the digestion of the Cpf1/crRNA complex was used to determine the cleavage efficiency of the crRNA. Only FibH-down 2 cleaved the purified PCR product incompletely ( Figure  2C). Using ImageJ to perform a grayscale analysis of the DNA bands, the cleavage efficiency of FibH-down 2 was about 98%, while the efficiency of the other three was about 100%.

Figure 1.
Overview of Cpf1-based enrichment and ONT sequencing (CEO) strategy. CEO mainly includes the dephosphorylation of gDNA, Cpf1 cleavage, adapter ligation, and ONT sequencing. CIAP was used to dephosphorylate the B. mori genomic DNA (gDNA), and the target site was released by cleavage with Cpf1/crRNA RNP. Next, the sticky ends were filled in, and the adapters were ligated, and finally, ONT sequencing was performed using PromethION.  Overview of Cpf1-based enrichment and ONT sequencing (CEO) strategy. CEO mainly includes the dephosphorylation of gDNA, Cpf1 cleavage, adapter ligation, and ONT sequencing. CIAP was used to dephosphorylate the B. mori genomic DNA (gDNA), and the target site was released by cleavage with Cpf1/crRNA RNP. Next, the sticky ends were filled in, and the adapters were ligated, and finally, ONT sequencing was performed using PromethION. which target the upstream and downstream of FibH ( Figure 2B), were desi efficiency score was predicted using CHOPCHOP [57]( Table 1). The resid the digestion of the Cpf1/crRNA complex was used to determine the clea of the crRNA. Only FibH-down 2 cleaved the purified PCR product incom 2C). Using ImageJ to perform a grayscale analysis of the DNA bands, th ciency of FibH-down 2 was about 98%, while the efficiency of the other t 100%.   The four crRNA-target sequences were obtained by PCR, and the PCR product was cleaved in vitro with the Cpf1/crRNA complex. "+" and "−" indicate the addition and absence of crRNA in the Cpf1 cleavage reaction, respectively. M3: D2000 Marker.
Cpf1 is not only an efficient genome editing tool, but it is also an important method for DNA assembly and cloning [31,34,54,55]. Whether in vivo or in vitro, its cleavage activity is affected by the DNA environment and crRNA sequence [56]. To obtain highly active Insects 2021, 12, 832 6 of 12 crRNA, four crRNAs, FibH-up 1, FibH-up 2, FibH-down 1, and FibH-down 2, which target the upstream and downstream of FibH ( Figure 2B), were designed, and their efficiency score was predicted using CHOPCHOP [57] (Table 1). The residual DNA after the digestion of the Cpf1/crRNA complex was used to determine the cleavage efficiency of the crRNA. Only FibH-down 2 cleaved the purified PCR product incompletely ( Figure 2C). Using ImageJ to perform a grayscale analysis of the DNA bands, the cleavage efficiency of FibH-down 2 was about 98%, while the efficiency of the other three was about 100%. Table 1. crRNA activity predicted by CHOPCHOP. Efficiency is the abbreviation of the efficiency score, which is obtained by position-specific scoring matrix or support vector machine based on the current literature. MM0, MM1, MM2, and MM3 represent the number of potential off-targets with 0, 1, 2, and 3 mismatches, respectively.

CEO Could Effectively Enrich the Sequence of Interest
Using CEO, 3,620,449 reads were obtained. There were 349 reads that were mapped to Bomo_Chr25: 10,354,516-10,372,477 target regions, of which 38 reads had a length of 16,000 bp or more ( Figure 3A), and some of the reads spanned the reference gene ( Figure 3B). ONT sequencing produced 18.21 Gbp of DNA sequencing data, the average depth of the genome was 38×, and the average depth of FibH was approximately 87×, with an enrichment fold of 2.29. Previous studies found that the DNA at both ends of the breakpoint could be ligated with an adapter after Cas9 cut the genome. Based on this, Haasteren et al. established the amplification-free integration site sequencing method and applied it to detect the lentiviral vector integration sites in the genome [58]. Similarly, CEO also enriched the upstream (between FibH-up 1 and FibH-up 2) and downstream (between FibH-down 1 and FibH-down 2) region of FibH, and the enrichment folds were 2.11 and 1.58, respectively (Table 2). These indicated that CEO could simultaneously perform enrichment analysis on the region of interest in the genome and its upstream and downstream sequences.

CEO Could Characterize the Fine Structure of FibH
ONT reads are long but have random errors. These errors can be corrected by increasing the sequencing depth and deep learning. To verify whether CEO can obtain a high-quality FibH consensus sequence (Supplementary Note S5), we assembled the reads aligned to the target region using Canu, and then aligned the ONT reads to the assembled sequence with Medaka to obtain a consensus sequence of 17,988 bp. So far, two FibH sequences have been published: one (KWMTBOMO15365) from SilkBase (Available online: http://silkbase.ab.a.u-tokyo.ac.jp (accessed on 2 June 2020)) and the other (AF226688.1) from NCBI. The annotated KWMTBOMO15365 has 4 introns, which is different from AF226688.1. We analyzed all introns and found that, with the exception of intron 1 (971bp) of KWMTBOMO15365, the sequence or partial sequence of the other three introns were identical to the repetitive unit, which implied that KWMTBOMO15365 was probably incorrectly annotated. Therefore, we defined that the reference gene is encoded by two exons (Figure 4). Compared to KWMTBOMO15365, the consensus sequence was identical to the reference sequence, except for 3-base deletion in the intron, with 99.98% identity compared to AF226688.1, with 22 SNPs and 11 gaps/insertions in the consensus sequence, which had 99.17% identity. The difference between KWMTBOMO15365 and AF226688.1 was probably caused by low-quality assembled AF226688.1, but this did not rule out the structural variation in the FibH gene between individual silkworms. These results indicated that there were structural variations in the genomes of the Japanese B. mori strain p50T and the Chinese strain Dazao and also proved that CEO could be used for structural studies of large genes with high repetition and low complexity. PEER REVIEW 7 of 12 AF226688.1. We analyzed all introns and found that, with the exception of intron 1 (971bp) of KWMTBOMO15365, the sequence or partial sequence of the other three introns were identical to the repetitive unit, which implied that KWMTBOMO15365 was probably incorrectly annotated. Therefore, we defined that the reference gene is encoded by two exons ( Figure 4). Compared to KWMTBOMO15365, the consensus sequence was identical to the reference sequence, except for 3-base deletion in the intron, with 99.98% identity compared to AF226688.1, with 22 SNPs and 11 gaps/insertions in the consensus sequence, which had 99.17% identity. The difference between KWMTBOMO15365 and AF226688.1 was probably caused by low-quality assembled AF226688.1, but this did not rule out the structural variation in the FibH gene between individual silkworms. These results indicated that there were structural variations in the genomes of the Japanese B. mori strain p50T and the Chinese strain Dazao and also proved that CEO could be used for structural studies of large genes with high repetition and low complexity.

CEO Could Identify Methylation of FibH
As early as 2010, the methylome of the B. mori suggested that there are a large number of methylation sites in the B. mori genome [59]. Since FibH is long and contains many repeat units, the classic bisulfite method cannot detect methylation, and hence, there is no report on FibH methylation in B. mori. ONT not only has a long read length, but can also recognize modified bases, which is widely used to study base modifications in the genome [22,[60][61][62][63].
To identify the presence of base modifications on FibH, we performed a methylation analysis on the sequencing data obtained by CEO using Nanopolish. There was a total of 506 CG sites in FibH, of which 306 were methylated. Most of these methylated CG sites were located on the repeat units (motifs). The methylation frequency of each motif and their methylation frequency at different positions on the FibH gene were counted (Table S2). Among them, motif-1 had the most repetitions, as many as 145, and the total methylation frequency was 2.121, followed by motif-2 and motif-3 with 92 and 65 times, respectively, and their methylation frequencies were 1.782 and 3.251, respectively. Surprisingly, although motif-10 only has three repetitions, its methylation frequency was higher than that of motif-4, and its methylation frequency at Bomo_Chr25: 10,360,481 was as high as 0.206, which was higher than all the other single sites ( Figure 5, Table 3). These results indicate that there were abundant methylated cytosines on the repeat units of FibH and that the number of repeat units was not correlated with the methylation frequency.

CEO Could Identify Methylation of FibH
As early as 2010, the methylome of the B. mori suggested that there are a large number of methylation sites in the B. mori genome [59]. Since FibH is long and contains many repeat units, the classic bisulfite method cannot detect methylation, and hence, there is no report on FibH methylation in B. mori. ONT not only has a long read length, but can also recognize modified bases, which is widely used to study base modifications in the genome [22,[60][61][62][63]. To identify the presence of base modifications on FibH, we performed a methylation analysis on the sequencing data obtained by CEO using Nanopolish. There was a total of 506 CG sites in FibH, of which 306 were methylated. Most of these methylated CG sites were located on the repeat units (motifs). The methylation frequency of each motif and their methylation frequency at different positions on the FibH gene were counted (Table S2). Among them, motif-1 had the most repetitions, as many as 145, and the total methylation frequency was 2.121, followed by motif-2 and motif-3 with 92 and 65 times, respectively, and their methylation frequencies were 1.782 and 3.251, respectively. Surprisingly, although motif-10 only has three repetitions, its methylation frequency was higher than that of motif-4, and its methylation frequency at Bomo_Chr25: 10,360,481 was as high as 0.206, which was higher than all the other single sites ( Figure 5, Table 3). These results indicate that there were abundant methylated cytosines on the repeat units of FibH and that the number of repeat units was not correlated with the methylation frequency.

Discussion
We established an enrichment sequencing method based on Cpf1 combined with ONT for genes with low complexity and high repetition. Using CEO, we described the fine structure and methylation modification of the highly repetitive FibH gene. The FibH

Discussion
We established an enrichment sequencing method based on Cpf1 combined with ONT for genes with low complexity and high repetition. Using CEO, we described the fine structure and methylation modification of the highly repetitive FibH gene. The FibH of Dazao strain had a 99.98% similarity to the reference gene. With the exceptions of the three base deletions in the intron, their exons were identical. This may be due to differences in the strains. This indicates that FibH in different B. mori strains or individuals show polymorphism. Meanwhile, the methylation modification on the repetitive sequence of FibH was discovered by CEO for the first time. However, when methylation occurs and its effect on the expression of FibH are still unknown, which will become the focus of our research in the future.
As we all know, the CRISPR/Cas system is an acquired immune response used by bacteria and archaea to prevent the invasion of foreign DNA, such as plasmids and phages [25]. Among the artificially synthesized CRISPR/Cas systems, the Class 2 system represented by Cpf1 and Cas9 is the most widely used. In addition to our CEO method, other Cas9 nanopore sequencing strategies have also been established, including Negative Enrichment [47], CaBagE [44], FUDGE [64], nCATS [65], and AFIS-Seq [58]. The first two methods are based on the fact that the Cas9 rests on the DNA strand for a short time after cutting the DNA, whereas the principles of the latter three methods are similar to those of our CEO method. They all use the CIAP to seal the DNA ends before cutting the gDNA with the Cas protein. The target site sequences enriched by these methods ranged from 1 kb to 100 kb, including sequence replication genes, but their enrichment efficiency varies. The side-by-side experiments comparing target enrichment with nCATS and CaBagE showed that their average depths ranged from 93× to 322× and 30× to 53×, respectively, and the former was 2.6 to 10.7-fold higher than the latter [44]. The median coverage of the FibH gene was 87× via CEO. This difference in coverage is attributed to the sequencer. CaBagE and nCATS use MinION, while the CEO uses PromethIon, which has a higher sequencing throughput. In contrast, nCATS has higher enrichment efficiency than other methods. Among them, the enrichment efficiency of Negative Enrichment and CaBagE is mainly affected by the activity of the crRNA and the degree of cleavage of the unprotected DNA by exonuclease, and the main factor that determines the enrichment efficiency of CEO, FUDGE, nCATS, and AFIS-Seq is crRNA (gRNA) activity and the degree of dephosphorylation of gDNA. It is necessary to compare the enrichment efficiency of these methods under the same conditions to study genes.
In addition to silk protein genes, genes with tandem repeats also include many diseaserelated genes, such as SAMD12 [66], BRCA1 [67], and VHL [41], which are caused by a duplication of repeats in these genes [9,68]. Understanding the sequence variation of these genes could provide clues for clinical treatment. We believe that our CEO will become an important means of disease-related gene research.