Agricultural pest arthropods damage crops and endanger animal and human health both directly through disease and indirectly by threatening global food supply. Specifically, herbivorous and parasitic insects impact plant and animal health, respectively, through direct feeding or by vectoring disease-causing viruses and pathogens. In the case of zoonotic diseases, the impacts on humans are compounded with effects on animal food production and human health. For example, ticks and tick-borne pathogens pose a major threat to US public health and livestock production, with the economic damage for Lyme disease alone estimated at up to USD 4.8–9.6 billion per year [1
]. Herbivorous insects can dramatically reduce the quantity and quality of products both pre- and post-harvest. An estimated 6% of maize production is lost to insect pests in the United States annually [2
], which is over USD 3 billion annually using the latest production data [3
] and a corn market price of USD 3.75 per bushel. The western corn rootworm (Diabrotica virgifera virgifera
) alone was responsible for USD 1.4 billion in direct production losses in 2010 [4
One grand challenge facing agriculture is the need to increase production by up to 70% to meet the demands of a human population anticipated to reach 10 billion by 2050 [5
] while simultaneously reducing environmental impacts and meeting the challenges posed by climate change. The threats to agriculture by insects are pernicious and ever-increasing, and pest control presents major hurdles for achieving 2050 production needs [6
]. Insects are not a new threat to agriculture, but their impacts on production have been greatly affected by pesticide use, climate change, and the introduction of non-native insects into new habitats and landscapes through the shipping of infested materials and agricultural products around the globe. Widespread insecticide resistance among arthropod pest species has emerged [7
], expanded seasonal activity and geographic ranges of native pests have increased damage [9
], and the migration of non-native pests between habitats has challenged ecosystems [10
]. Our ability to control arthropod pests must undoubtedly also evolve and adapt to mitigate these threats, and genomics, in particular, holds promise to facilitate the development of innovative and resilient control technologies.
Genome assemblies provide comprehensive information about the genome that cannot be matched by transcriptome sequencing and assembly. Full genome assemblies are not restricted to a subset of expressed regions that can easily miss gene duplications, regulatory components, and genes with low expression levels. Non-transcribed regions of the genome can influence gene expression in various ways [11
]. For example, promoters, enhancers, and other DNA segments more commonly impact gene regulation compared to protein-coding regions of the genome, which can have strong impacts on phenotype [16
]. In addition, non-translated RNAs, such as microRNAs or long non-coding RNAs (lncRNAs) that are not identified in typical transcriptome sequencing, can play key roles in establishing phenotypes and improve our understanding of how insects interact with their plant hosts and adapt to changing environmental conditions [20
]. Recent estimates suggest that nearly 90% of economically or ecologically important traits in organisms may be determined by variation in non-coding regions of the genome [22
], indicating the need for high-quality reference genome assemblies to study traits relevant to pest management.
Large-scale genome sequencing initiatives such as i5k, the initiative committed to sequencing 5000 arthropod genomes [23
], are developing the infrastructure to build reference-quality genome assemblies to facilitate basic and applied research that will lead to improved pest management tactics. A pilot project of the i5k produced genome assemblies of 28 species and greatly improved our understanding of the challenges of sequencing arthropods [25
]. More recently, the Earth BioGenome Project (EBP) has brought together numerous affiliated consortia to produce reference-quality genome assemblies from species across the tree of life, with the ultimate goal of sequencing all eukaryotes over a 10-year period [26
]. The Ag100Pest Initiative [27
] is a bold endeavor by the United States Department of Agriculture, Agricultural Research Service (USDA-ARS) to generate reference-quality genome assemblies for the top 100 US agricultural pest arthropod species, thus advancing the missions of both the i5k Initiative and the EBP [26
The USDA-ARS performs research to support the health of beneficial arthropods and control the damaging effects of pests in order to enhance food security and human health [28
]. This article describes the framework for the Ag100Pest Initiative, encompassing the scope, operation, and challenges and lessons learned since inception. The Ag100Pest Initiative is developing low-cost, high-quality reference genomes from single insect specimens, including insects of large and small physical and genome size. Organizing a coordinated initiative to address these goals is not a trivial undertaking; it requires adequate infrastructure, streamlined and effective methodologies for library production, sequencing and bioinformatic analysis, operational and administrative schemata, and, of course, funding. Technological aspects will undoubtedly change as sequencing and assembly methods evolve, but the Ag100Pest Initiative framework and operational advances can inform those currently involved in or planning analogous endeavors. Ag100Pest has developed a pipeline using a combination of long-read sequencing from a single specimen and HiC scaffolding, along with companion RNA expression data, to generate annotated genome assemblies that meet or exceed EBP standards (Figure 1
). This effort is greatly changing the landscape of insect genomics research, and we hope that by sharing our insights, others will join in this revolution.
The Ag100Pest Initiative has prioritized the sequencing and assembly of genomes from 158 species from 54 families across 8 arthropod orders. This includes 18 families and 121 species that lack a publicly available assembly of any quality (Figure 2
; species list at [27
]). The total number of assemblies in progress will be higher than the number of species as we are sequencing multiple isolates, biotypes, subspecies, or sexes for some species. Selection of species for the Ag100Pest Initiative was made on the basis of their status as important beneficial or pest species, as opposed to maximizing taxonomic breadth. Nevertheless, we will make a substantial contribution to the EBP goal of generating a reference assembly for a representative of every eukaryotic family and an assembly for every species [1
]. Toward this end, our focus on high-quality assemblies (defined, in part, by the Vertebrate Genomes Project (VGP) [65
] as contiguity measures of contig N50 > 1 Mbp and scaffold N50 > 10 Mbp) will elevate the overall contiguity and accuracy of arthropod genomes in the public domain and provide a family level representative for 45 families across 3 orders that currently lack a high-quality assembly for any species (Figure 2
). A notable impact in the order Coleoptera is expected with our goal of contributing 50 assemblies, nearly doubling the current number of 54 lower-quality coleopteran public assemblies (Table 1
). The contig assemblies already generated for almost half of the intended Ag100Pest coleopteran genome assemblies (22 species; Figure 3
) surpass the contig contiguity of the majority of publicly released assemblies for this order. Other similarly substantial impacts will be made for orders Hemiptera, Hymenoptera, Ixodida, and Orthoptera (Table 1
Ag100Pest began by using continuous long reads (CLRs) for assembly (details not presented herein) as the improved HiFi procedure [33
] had not yet been developed. Working in collaboration with Pacific Biosciences, methods for low DNA input library preparation and HiFi sequence generation were developed that were key to the success of the Initiative. The choice of library preparation method is highly dependent on individual samples and beyond the scope of this project overview. However, key aspects for consideration are organism size (i.e., the amount of DNA available for an individual sample), difficulty of extraction (i.e., the quality and size distribution of DNA fragments), and genome size. The methods available range from ultra-low input methods, suitable when the genome size is less than 1 Gbp and the specimen size is very small, to standard library preparation methods when the individuals are relatively large and the genome size is also large and requires multiple sequencing runs to achieve desired coverage. For most insects, we find the low-input protocol [66
] is the best compromise between the three available library preparation methods as we find that it performs well for relatively small insects over a range of genome sizes.
The majority of selected Ag100Pest species do not have existing public assemblies; however, 37 species with relatively low-quality assemblies were included to improve their assembly quality (Figure 2
). We have generated contig-level assemblies for 11 of these 37 (Table 2
), 10 of which we improved contig N50 by several orders of magnitude. The exception, Haemaphysalis longicornis
, illustrates the difficulties inherent in a project attempting to assemble a broad diversity of Arthropoda genomes. Our initial contig N50 showed only a modest improvement over the previous assembly. Likely because H. longicornis
present in the United States appears to be parthenogenetic and is, therefore, either triploid or aneuploid [67
], our assembly size is substantially larger than the predicted genome size. This suggests the presence of haplotypic duplication that complicates the generation of a single haplotype representation of a polyploid genome [35
]. We anticipate that the contig N50 of our assembly will improve after the haplotypic duplication is removed [68
] because the alternate haplotype contigs tend to be smaller and, therefore, artifactually reduce the N50 value. Nevertheless, this species illustrates one example of the challenges inherent in developing a “one-size-fits-all” pipeline applied to the huge diversity of arthropod species.
For the 47 species distributed across seven orders for which we have completed HiFi long-read sequencing and contig assemblies, our assembly lengths range from 144 to 8.7 Gbp, with contig N50s ranging from 0.88 to 70 Mbp (Figure 3
, Table S2
). Final contig N50 and assembly sizes for these assemblies may change during the scaffolding and contamination removal steps. After the completion of these processes, the assemblies will be deposited into NCBI. The Ag100Pest initiative is committed to the free and open access of all data in the public domain while still maintaining defined ownership of input specimens and assembly outputs through academic research agreements to protect the interests of all parties involved.
The Ag100Pest Initiative was launched in October 2018, at which time only 6 of 366 (1.6%) arthropod genomes then available through NCBI met our standards of contiguity (taken from those [65
] of the Vertebrate Genomes Project (VGP) for defining high-quality assemblies). Therefore, while producing genome assemblies that met the VGP standard was possible at the time for a handful of species, it was not straightforward for the majority of arthropods due to technological and biological issues. Ag100Pest’s goal to produce reference-quality assemblies was, therefore, all the more audacious in 2018 because we intended to sequence at scale, with long-read sequencing coming from a single specimen, not pools, for a wide variety of species across several taxa. The success of our project has not only allowed it to expand beyond the initial intended 100 species but to provide a framework by which other initiatives can also contribute to the lofty goal of the EBP to sequence all known eukaryotic species.
The inability to produce long-read data from single specimens was a technological challenge that hindered assemblies in the past, fracturing assemblies and inflating the number of haplotigs that originated from the same genomic interval. Advances in genomic DNA isolation, long-read library construction, and sequencing [69
] have been fundamental to the success of the Ag100Pest Initiative, helping to ensure the assemblies produced by Ag100Pest will meet or exceed quality metrics established by the EBP [26
] and VGP [65
]. Our continuous integration and refinement of new methods to address particular challenges posed by arthropods have allowed Ag100Pest to sequence species that were not tractable when we began this project. Specifically, the reduction in input DNA requirements since the project’s inception has generated low and ultra-low input protocols for long-read sequencing libraries [66
] that have allowed us to sequence species with very small physical sizes. Additionally, PacBio’s optimization of circular consensus sequencing (CCS) greatly increased the sequencing accuracy and generation of High-Fidelity (HiFi) reads [33
], which hold many benefits over CLR. With these decreases in input requirements and increases in output accuracy, sequencing data can be generated from a single specimen rather than pools of specimens. Assembly phasing is, therefore, improved and the introduction of additional heterozygosity into the assembly graph is reduced, resulting in a more complete and contiguous assembly. Long-read sequencing technology now enables high-quality arthropod genome sequencing and assembly across the broad diversity of arthropods.
Unfortunately, some species still present unique challenges to DNA extraction, sequencing efficiency, and assembly contiguity, and, often, these cannot be anticipated in advance. We have found that sequencing output varies across species and cannot always be attributed to sample quality. In general, we found that sequencing success was most improved when HiFi libraries were immediately prepared from recently extracted DNA that had not been frozen, stored for long periods of time, or shipped. Therefore, we do not recommend shipping extracted high molecular weight (HMW) DNA to a sequencing facility for library preparation and sequencing. Instead, we recommend either sending the specimen itself to the facility for DNA extraction and library preparation or preparing libraries before shipping. Additionally, while highly accurate CCS long-read sequencing that produces HiFi reads is currently the best approach to resolving repetitive genome architecture, regions with large arrays of highly similar repeats, longer than the sequencing reads themselves, may remain difficult to assemble without the incorporation of ultra-long reads. These remaining challenges are small in comparison to the state of the field just two years ago, when only a small fraction of assemblies met high-quality standards (Figure 3
Only 101 of 787 (12.8%) arthropod species currently have a genome assembly in the public domain that meets the definition of high-quality (Figure 2
). With the advancements noted above, highly accurate, low-cost sequencing technology and genome assembly methods are no longer the limiting factors for producing high-quality genome assemblies in the vast majority of arthropods despite the wide range of physical and genome size challenges they present. By adopting the latest sequencing and assembly methods and paying particular attention to details such as proper specimen preservation, reference genome assemblies can be produced by all sequencing consortia. We encourage other sequencing consortia to commit to the production of high-quality genome assemblies in order to advance both the phylogenetic breadth of sequenced species and their overall contiguity and completeness.
The high-quality genome assemblies Ag100Pest is producing for pest arthropods are fundamental infrastructure for basic and applied research. One benefit of having the USDA-ARS undertake this project is that Ag100Pest can leverage personnel and infrastructure resources by making investments in permanently funded staff, sequencing platforms, and computational support that are not limited by typical granting cycles. USDA-ARS scientists also possess unique expertise in arthropod pest management and agricultural genomics research across a wide breadth of commodities and cropping systems. Sequencing of arthropods advances our understanding of the physiology, ecology, and evolution of pests and beneficial arthropods. Translational research products based on that knowledge will lead to improvements in the agricultural economy that will come to agricultural producers through technological advances in the efficacy and durability of environmentally sustainable pest management practices. For example, high-quality genome assemblies are used in the development of novel molecular-based management tools that target pests while sparing environmental damage, particularly damage to beneficial arthropod populations. As such, the accumulation of genome assemblies for arthropods contributes to a foundation of support for the bioeconomy. Increasing profitability while reducing any negative environmental impacts of agricultural production directly benefits rural economies, societal well-being, and overall human health. Maintaining the quantity, quality, and stability of production is critical to global food security that is required to provide nutritious food to a growing human population as well as raw materials for industrial production of bio-based products. The Ag100Pest Initiative addresses this multitude of stakeholder needs through the development of high-quality foundational genomic information that is anticipated to facilitate the development of novel tools and products for the targeted management of pests and the preservation of beneficial insect health. While these and other outcomes, as well as changing stakeholder needs, will continue to reprioritize objectives within the Ag100Pest Initiative, we remain committed to supporting the scientific community and agricultural and societal interests.