Cell lines derived from Trichoplusia ni
, the cabbage looper moth, have been used for many years to produce recombinant proteins by means of the baculovirus expression vector system (BEVS). While cell lines from other lepidopteran hosts such as Spodoptera frugiperda
have commonly been used for production of baculoviruses, Trichoplusia
cell lines have been shown in several cases to out-perform these cell lines for production yield and protein quality, particularly with regard to secreted proteins [1
]. In the past decade, insect cell protein production has emerged as a viable alternative to bacterial and mammalian cells for the production of therapeutically relevant proteins, with several vaccine products generated in baculovirus-infected insect cells having been approved by regulatory agencies [2
]. Therefore, a comprehensive systems biology approach to improving protein production in these cell lines would be of significant benefit to their potential utility as protein production hosts. However, to date, there are only a few complete genome sequences of lepidopteran hosts. One of them is the silkworm Bombyx mori
, that has been published [4
], while an incomplete draft genome of Spodoptera frugiperda
(the host from which Sf9 and Sf21 lines were derived) is the only sequence available for the more commonly used protein production hosts [5
]. The transcriptome [6
] of the Trichoplusia ni
-derived cell line, Tnms42, and RNA-seq data [7
] from the High Five cell line (BTI-Tn-5B1-4), have been published, but these data are not useful for large-scale genome engineering due to a lack of non-coding genomic DNA information. In addition, transcriptome data are inherently biased towards genes with high transcription levels, and likely lack coverage of significant regions of the coding genome.
Tni-FNL is a cell line derived by adaptation of BTI-Tn-5B1-4 cells, originally isolated from Trichoplusia ni
egg cells in the Wood laboratory at Cornell [8
]. While the original cell line was an adherent cell line that grew in the presence of serum, Tni-FNL was selected for suspension growth to optimize its utility for protein production and for the ability to grow in the absence of serum. The Tni-FNL cell line has been shown to routinely produce higher levels of protein than Sf9 or Sf21 cells, and in some cases to surpass the levels produced in the more commonly used and commercially available Trichoplusia ni
cell line, High Five. High Five cells were derived from the same parent line as Tni-FNL, suggesting that the specific process used for adaptation likely effected changes in the cell line, which resulted in this improvement in protein production. For these reasons, we decided to elucidate the complete genome sequence of the Tni-FNL cell line. This will benefit systems biology approaches to create improved cell lines that can support higher levels of protein expression and potentially improve the quality and lower the cost of therapeutic protein production.
Next-generation sequencing technologies have long been used in the genome assembly of many animal and plant genomes. However, the short-reads they produce have difficulty spanning repetitive regions commonly found in many genomes, and therefore, generate draft genomes consisting of many gaps with potential mis-assemblies and collapsed contigs. Recent advances in sequencing technologies, especially in single-molecule sequencing [9
], have resulted in the ability to sequence reads that are longer than most of the common repeats in both microbial and vertebrate genomes, leading to the generation of highly contiguous assemblies. Combining PacBio single-molecule sequencing [9
] with complementary technologies such as Illumina short reads, Bionano optical mapping [10
], and 10X Genomics (Pleasanton, CA, USA) linked reads [11
] has become the recommended strategy for optimal genome assembly [12
]. Here we report that by applying the new technologies and assembly strategies, we have generated the first draft genome assembly of Tni-FNL cell line, which was derived from Trichoplusia ni
cells. Comparative analysis between our draft genome of the Tni-FNL (Trichoplusia ni
) genome with other closely related species, as well as the recently published Hi5 germ cell genome assembly [13
], provided further evidence of high accuracy and completeness of our Tni-FNL cell line genome assembly.
2. Materials and Methods
2.1. Cell Culture Conditions
Tni-FNL cells were cultured under shaking conditions (125 rpm, 2-inch throw) in 500 mL shaker flasks at 27 °C in Gibco Sf-900 III SFM media (Thermo Fisher Scientific, Waltham, MA, USA).
2.2. PacBio Library Preparation and Sequencing
High-molecular-weight genomic DNA (20–150 kb) was extracted from the cultured Tni-FNL cell line using the Genomic-tip 20/G kit (Qiagen, Hilden, Germany). For PacBio library preparation, approximately 15 µg of genomic DNA were sheared to an average size of 20 kb using a g-TUBE™ (Covaris®, Woburn, MA, USA). All sizing and quantitation measurements were performed using the genomic kit for the TapeStation 2200 (Agilent Technologies, Santa Clara, CA, USA). Purity was assessed by calculating the ratio of absorbance at 260 nm to absorbance at 280 nm as measured on a NanoDrop™ spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and was determined to be suitable. Following PacBio’s standard 20 kb library preparation protocol 100-286-000-05, the final library was size selected using a dye-free 0.75% agarose cassette on a BluePippin (Sage Science, Beverly, MA, USA) with a lower cutoff of 10 kb. 16 SMRT® cells were sequenced on the PacBio RS II (Pacific Biosciences, Menlo Park, CA, USA) using P6/C4 chemistry, 0.15 nM MagBead loading concentration, and 360 min movie lengths. Additionally, 5 µg of genomic DNA from the same sample were sheared to an average size of 20 kb using a g-TUBE (Covaris, Woburn, MA, USA), which was used as input to create a library using the Accel-NGS® XL Library Kit for Pacific Biosciences® (Swift Biosciences™, Ann Arbor, MI, USA). The final library was size selected using a dye-free 0.75% agarose cassette on a BluePippin (Sage Science, Beverly, MA, USA) with a lower cutoff of 15 kb. 2 SMRT® cells were sequenced on the PacBio RS II (Pacific Biosciences, Menlo Park, CA, USA) using P6/C4 chemistry, 0.15 nM MagBead loading concentration, and 360 min movie lengths. Additionally, approximately 15 µg of genomic DNA were sheared to an average size of 20 kb and were prepared for sequencing on the PacBio Sequel System, using a size selection with a 15 kb lower cutoff on the BluePippin. Two Sequel 1M SMRT® Cells were sequenced using Sequel Polymerase 2.0 and Sequel Sequencing Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA).
2.3. Bionano Optical Mapping
Optical mapping was performed using the Irys optical mapping technology from Bionano Genomics (San Diego, CA, USA). The sample was prepared as per the IrysPrep Plug Lysis protocol 30026 Rev D and Labeling-NLRS protocol 30024 Rev J. Two million cells from the Tni-FNL cell line were embedded in an agarose plug for extraction of ultra-high-molecular-weight genomic DNA (100–2000 kb). Briefly, the cells were washed with phosphate buffered saline (PBS), the cell suspension was mixed thoroughly with 2% Agarose, and then set into cold plug molds for 15 min. Plugs were treated overnight with Proteinase K at 50 °C, followed by RNase A digestion at 37 °C for 1 h. After washing the plugs with wash buffer and TE, DNA were recovered by incubating the molten plug with Agarase for 45 min at 43 °C. The DNA were further cleaned by Drop Dialysis using a 0.1 µm dialysis membrane set on top of TE in a petri dish. The DNA were dispensed on top of the membrane and dialyzed for 45 min. Homogenization of the DNA was achieved by overnight incubation at room temperature. 600 ng of purified high molecular weight DNA were nicked using 80 Units of the nicking endonuclease Nb.BssSI (New England Biolabs, Ipswich, MA, USA) for 2 h at 37 °C. Fluorescently tagged nucleotides were then incorporated at the nicked sites by Taq DNA polymerase during the labeling reaction at 72 °C for 60 min. This was followed by repair in the presence of polymerase and Taq DNA ligase for 30 min at 37 °C. After counterstaining the DNA backbone with the YOYO-1 dye, the final sample was quantitated again and 9 µL were loaded into each flowcell of an IrysChip. The labeled DNA molecules were linearized in the nanochannels on the chip and imaged by the Irys instrument (Bionano Genomics, San Diego, CA, USA). Both flowcells were run per the Modified Base Recipe for 30 cycles, with the DNA concentration time of 200 s. After the first run, pillar cleaning of the chip was performed, and the chip was imaged again for an additional 30 cycles. This over-cycling was performed three additional times for both flowcells of the IrysChip (30 cycles in each run) to acquire additional data.
2.4. 10X Genomics Linked Reads Sequencing
High-molecular weight DNA from the Hi5 cells (extracted using the Bionano Plug Lysis protocol) was also used to make 10X libraries, as per the Chromium Genome library preparation protocol from 10X Genomics (Pleasanton, CA, USA). In brief, 0.9 ng/µL DNA were used for GEM generation in the Chromium Controller machine (10X Genomics, Pleasanton, CA, USA). The long DNA molecules were partitioned along with oligo-coated Gel Beads that provide a 16 bp 10X barcode, an Illumina R1 sequence, and a 6 bp random primer sequence. Isothermal incubation of the GEMs at 30 °C for 3 h, followed by 65 °C for 10 min produced barcoded fragments. These fragments were recovered from the GEMs and cleaned up for subsequent library preparation steps that included end repair, A-tailing and adapter ligation per the manufacturer’s recommendations. Eight cycles of amplification during the sample index PCR provided enough yield of the indexed library. The library was quantitated by qPCR and sequenced on NextSeq (High output kit) (Illumina, San Diego, CA, USA) with 2 × 150 paired-end reads.
2.5. Transcriptome Sequencing of Tni-FNL Cell Line
Total RNA was extracted from Tni-FNL cells using the NEB Monarch Total RNA Miniprep kit (New England Biolabs, Ipswich, MA, USA) as per the manufacturer’s instructions. Briefly, a frozen pellet of approximately 20 million cells was thawed and resuspended in RNA lysis buffer. Genomic DNA was removed by binding to a gDNA removal column, followed by purification of RNA by binding to the RNA purification column. On-column DNase I treatment was performed for removal of residual gDNA. RNA was eluted in nuclease-free water and RNA integrity was assessed using the RNA 6000 Nano Kit on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). Approximately 20 µg RNA was obtained and aliquoted before storage and further use.
Short-read sequencing library was prepared from the Tni-FNL total RNA using the NEBNext Ultra II Directional RNA Library Prep kit (New England Biolabs, Ipswich, MA, USA) as per the manufacturer’s instructions. Briefly, 1 µg and 500 ng total RNA was subjected to rRNA depletion by rRNA probe hybridization and RNase H digestion. The excess probes were removed by DNase I digestion and the RNA purified using RNAClean XP beads (Beckman Coulter, Brea, CA, USA). RNA was fragmented at 94 °C for 5 min to generate a large insert library. Accordingly, longer incubation time (50 min at 42 °C) was used during first strand cDNA synthesis. Purified double-stranded cDNA was subjected to end prep and adaptor ligation as per the protocol. After purification and selection for larger size fragments, the adaptor ligated DNA was enriched using 9 cycles of PCR amplification and purified using SPRIselect beads (Beckman Coulter, Brea, CA, USA). Library quality assessment on Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) revealed the average library sizes to be around 430 bp.
Total RNA libraries generated from 1 µg and 500 ng Tni-FNL RNA were pooled and quantified by qPCR. Paired-end sequencing (150 bp reads) was performed on the Illumina (San Diego, CA, USA) MiSeq platform using v2 sequencing chemistry.
2.6. Propidium Iodide (PI) Staining for Ploidy Determination
Tni-FNL and Sf9 cells in exponential growth phase were harvested at a cell concentration of approximately 1 × 106 cells/mL and centrifuged at 1050 rpm for 5 min after which the supernatant was discarded. The cell pellet was suspended in 20 mL of cold (−20 °C) 70% ethanol for fixation and the samples were stored at −20 °C. On the day of flow cytometry analysis, cell samples were centrifuged at 1050 rpm for 5 min to remove the ethanol fixative. The supernatant was discarded and the cells were washed two times with 10 mL phosphate buffered saline (PBS). The sample was centrifuged again at 1050 rpm for 5 min, the supernatant was discarded, and the pellet was suspended in 1 mL of RNase solution (250 μg/mL; Sigma, St. Louis, MO, USA) for 20 min at 37 °C. A 50 μL aliquot of propidium iodide (PI; 50 μg/mL) was added to each sample, mixed, and incubated at room temperature for 5 min, before analysis by flow cytometer.
2.7. Flow Cytometry Analysis
Flow cytometric analysis was performed using the LSRFortessa (BD Biosciences, San Jose, CA, USA) SSC-A vs FSC-A with a gate for cell population. Single cells were selected for analysis by using the distribution of propidium iodide-W against propidium iodide-A to discriminate doublets and debris. The propidium iodide-A voltage was adjusted to set the mean of the singlet peak of the Sf9 cell (reference cell) G0/G1 population at 50,000 in the histogram. The data were collected using FACSDiva software version 8.0 (BD Biosciences, San Jose, CA, USA) and analyzed by FlowJo software version 10.2 (FlowJo, Ashland, OR, USA). The DNA index was calculated as the ratio of the mean fluorescence intensity (MFI) of the Tni-FNL cell G0/G1 population to the MFI of the normal reference (Sf9) G0/G1 population. Ploidy of the test sample was then calculated based on the DNA index and the ploidy of the normal reference.
2.8. Karyotype Analysis
Chromosome preparations were obtained from established cultures of Tni-FNL. Vinblastine (5 mg/mL; Sigma, St. Louis, MO, USA) was added to the cells for 2 h prior to harvest and incubated at 27 °C. Cells were treated with hypotonic solution (KCL 0.075M) for 20 min at 37 °C and fixed with methanol: acetic acid 3:1. Slides were prepared at 60% humidity and aged overnight. Pairing was completed using slides that were stained with a trypsin-Giemsa staining technique (GTG). Analyses were performed under an Axio Imager Z2 (Zeiss, Oberkochen, Germany) microscope coupled with a VDS CCD-1300 camera (Genasis, ASI, Carlsbad, CA, USA); images were captured with Spectral Acquisition Band View 7.2 karyotyping software, (Applied Spectral Imaging Inc., Carlsbad, CA, USA).
2.9. NGS De Novo Assembly Methods
The genomic libraries were sequenced on two different sequencing platforms including the PacBio Sequel and RSII systems (Pacific Biosciences, Menlo Park, CA, USA) and the Illumina NextSeq 500 (Illumina, San Diego, CA, USA). We performed de novo assembly of the PacBio sequencing reads using the HGAP4 assembler [14
] from SMRT Link software version 4.0.0 (Pacific Biosciences, Menlo Park, CA, USA). The HGAP4 assembly consensus was polished using the Quiver software in the SMRT Link software package. In addition, the Canu v1.4 assembler [15
] was used to generate a second set of primary assembly. The Canu assembler was run with all three options of trimming, error correction and assembly. 10X Genomics Supernova v1.2.0 (10X Genomics, Pleasanton, CA, USA) was run iteratively for subsampling in order to find the best genome coverage and optimal assembly results. We subsampled 42× linked reads sequence data and performed de novo assembly.
2.10. Bionano De Novo Assembly
De novo assembly was done using the Bionano Genomics RefAligner version 5122 software (Bionano Genomics, San Diego, CA, USA). First, we merged all the Bionano runs using the merge function of the IrysView. Then the merged molecules set was used for Bionano de novo assembly. The converted Hierarchical Genome Assembly Process (HGAP4) assembly Consensus Map (CMAP) file was also supplied for the error rate estimation. Analysis parameters were given from the optArguments_human.xml. We generated multiple assemblies using the different minimum length cutoffs (150 kb, 180 kb and 210 kb) with two different CMAP-converted fasta assemblies (HGAP4 and Canu). After checking the resulting de novo assemblies, we decided to use the 150 kb minimum length cutoff in conjunction with the HGAP4 fasta file supplied as the CMAP file for our final de novo assembly.
2.11. Bionano Hybrid Assembly
For the step-one hybrid assembly (V1) we used the de novo Bionano assembly with the HGAP4 fasta assembly using the parameters from the aggressive human assembly setting, choosing Nb.BssSI (New England Biolabs, Ipswich, MA, USA) as the enzyme and a threshold p value of 1 × 10−10. In the two steps of hybrid scaffolding to align Bionano genome maps with PacBio WGS assemblies, the parameters -B2 and -N1 were used to only cut optical mapping assemblies when a conflict was found. For the step-two (V2) version of the hybrid assembly, we merged the mapped CMAP file in the hybrid V1 assembly with the unmapped CMAP file not used in the hybrid assembly with RefAligner version 5122 (Bionano Genomics, San Diego, CA, USA). Then we used the merged CMAP from V1 hybrid with the Canu assembly to carry out the V2 version of the hybrid assembly. The same parameters were used for the V2 hybrid assembly.
2.12. Assembly Error Correction
The final hybrid scaffold assemblies were error-corrected using Pilon [16
]. The raw Canu assembly was mapped to Illumina data using the BWA-mem aligner. After mapping, the bam file and Canu raw assembly were supplied to Pilon to perform error correction.
2.13. Transcriptome Assembly
] was used to remove erroneous k-mers from Illumina paired-end short reads. Adapters and low-quality reads were trimmed using trimmomatic tool. The trimmed pair-end reads were assembled by using trinity assembler (--SS_lib_type FR and --min_kmer_cov 1). The assembly statistics was calculated using Quast. The completeness of the assembly was assessed using BUSCO against Endopterygota database.
2.14. Gene Predictions and Repeat Annotations
The complete genes were predicted from the repeat masked genome using Maker v 2.31.8 pipeline [18
] as described in the GC Specific Maker pipeline. After the first initial run of Maker with est2genome [19
], the resulting annotation was divided based on GC content as high and low GC data sets. High and low GC datasets along with the original first maker annotations were used to train the SNAP [20
] and Augustus [21
] HMMs for the gene prediction. In the final run, the assembly was trained against six models, including three from SNAP and three from Augusts, using Maker. The high quality gene models were filtered by choosing Annotation Edit Distance (AED) cut off 0.5 according to the published Maker protocol [22
2.15. Phylogeny Analysis for Ten Insect Genomes
Genomes of Bombyx mori
(GCA_000151625.), Cimex lectularius
(GCA_001460545.1), Bombus terrestris
(GCF_000214255.1), Bombus impatiens
(GCF_000188095.1), Helicoverpa zea
(GCA_002150865.1), Mamestra configurata
(GCA_002192655.1), Helicoverpa armigera
(GCA_002156985.1) and Cimex lectularius
(GCA_000648675.1). were downloaded from NCBI. For Drosophila
, the BDGP6 version of the genome was used. Nine genomes described above and the Trichoplusia ni
Tni-FNL assembly were used with Busco [23
] to annotate the completeness of single-copy orthologs. We used a total of 250 strict one-to-one orthologs from the 10 species to run the phylogeny analysis. One fasta sequence was generated per species by appending each of the 250 ortholog sequences. The final file containing a single sequence per species was used for multiple sequence alignment by MUSCLE [24
]. RAXML [25
] was used to generate the maximum likelihood phylogeny from the concatenated multiple sequence alignment using 1000 bootstrap. The resulted Newick formatted tree was plotted using the iTOL [26
By combining PacBio single-molecule long-read sequencing with Bionano optical genome mapping and 10X Genomics long linked-reads technologies, we were able to produce a high-quality genome assembly of the Tni-FNL cell line genome. Since lepidopteran chromosomes are prone to chromosome fragmentation and complex karyotyping [37
], assembly of a lepidopteran host genome presented a major challenge. With an average PacBio sequencing read length greater than 10 kb, the reads could easily span most repetitive elements and were unambiguously placed on the correct chromosomes, which enabled us to build a highly contiguous assembly. Two sets of WGS assemblies from the HGAP4 and Canu assemblers were generated in this process. We further improved the WGS assemblies with integration of Bionano genome maps to build hybrid scaffolds. Genome maps helped identify chimeric contigs and fixed mis-assemblies and redundancies present in the WGS contigs. We improved the hybrid scaffold results by using the long linked-reads generated from the 10X Genomics Chromium platform. The purpose of using barcoded long linked-reads that originated from larger, single molecules of DNA was to replace the approach of using costly BAC clone or pooled fosmid clone libraries. The 10X linked-reads assemblies can effectively measure the contiguity and completeness of the hybrid scaffolds produced from the PacBio and Bionano data and help to identify connection errors found in the PacBio-Bionano hybrid assembly. By combining single-molecule sequencing with complementary technologies such as optical genome mapping and 10X linked-reads, we produced a high-quality genome assembly, which comprises 359 million base pairs with fewer than 5.5% of gaps. It represents one of the most contiguous draft assemblies of a lepidopteran host genome to date.
To further assess the genome assembly quality and elucidate gene functions in this lepidopteran host, we performed the gene prediction and comparative analysis with other insect genomes, as well as utilized the transcriptome sequencing data for this cell line and data generated from previous studies [6
]. Comparative analyses of orthologs from the Tni-FNL (Trichoplusia ni
) genome and other lepidopteran hosts confirmed the previous study findings that insect genomes share a common set of genes to maintain their integrity through their evolution [44
]. The total number of repeat elements identified in the Tni-FNL genome is very similar to those reported in other lepidopteran insect genomes. The results offer insights for further studies on how changes in the degree of repeat regions are involved in maintaining genome integrity among insert genomes.
The predicted genes of Tni-FNL identified in this study also provide additional resources for studying genetic variations and genome evolution in insect genomes. As previous studies suggested, the full-genome sequences from multiple species can complement each other by clarifying gene function and organization. In addition, this work will enable efforts to develop system biology tools to improve the utility of the Tni-FNL cell line for protein production. Recent reports have demonstrated for the first time the ability to genetically engineer Trichoplusia
cell lines using the CRISPR/Cas9 system [49
]. High-quality genome modification requires detailed genomic information to ensure high efficiency of targeted modifications and reduction in unwanted off-target effects. Using the high-quality genome sequence and newly developed engineering tools, we believe it will be possible to begin to make modifications to Tni-FNL, which will ultimately improve the quality and lower the cost of therapeutic protein production using this system.