Filtering the Junk: Assigning Function to the Mosquito Non-Coding Genome

Simple Summary In eukaryotes, the fraction of the genome not coding for proteins vastly outsizes the portion containing protein-coding genes. This non-coding genome, once termed “junk”, was thought for decades to be inconsequential to the biology of an organism. It is now widely acknowledged that elements within the non-coding genome serve important gene-regulatory functions impacting when, where, and to what levels genes and their protein products are expressed. Without an amino acid-like code to decipher non-coding regulatory elements within the genome, significant technology development has aided in their discovery. Currently, genome-wide identification of non-coding regulatory elements is an active area of research with significant progress made in humans, mice, and other model organisms. However, work to address the roles of these elements in mosquito disease vectors is in its infancy. In this article, we review existing methodology to generate genome-wide catalogs for three classes of non-coding elements and discuss their use in mosquito disease vectors and other insects. Abstract The portion of the mosquito genome that does not code for proteins contains regulatory elements that likely underlie variation for important phenotypes including resistance and susceptibility to infection with arboviruses and Apicomplexan parasites. Filtering the non-coding genome to uncover these functional elements is an expanding area of research, though identification of non-coding regulatory elements is challenging due to the lack of an amino acid-like code for the non-coding genome and a lack of sequence conservation across species. This review focuses on three types of non-coding regulatory elements: (1) microRNAs (miRNAs), (2) long non-coding RNAs (lncRNAs), and (3) enhancers, and summarizes current advances in technical and analytical approaches for measurement of each of these elements on a genome-wide scale. The review also summarizes and highlights novel findings following application of these techniques in mosquito-borne disease research. Looking beyond the protein-coding genome is essential for understanding the complexities that underlie differential gene expression in response to arboviral or parasite infection in mosquito disease vectors. A comprehensive understanding of the regulation of gene and protein expression will inform transgenic and other vector control methods rooted in naturally segregating genetic variation.


Introduction
Phenotypic diversity in mosquitoes cannot be explained using only variability among protein-coding regions of the genome. Rather, phenotypic variation may be the result of differences in gene and protein expression driven by changes in three-dimensional chromatin structure and regulatory elements residing within the non-coding, "junk", regions of the genome. In 1972, geneticist Susumu Ohno coined the term "junk DNA" to describe all non-coding portions of the genome. These "junk" regions, comprising up to 80% of the genome, are scattered randomly throughout the genome and often arise

MicroRNAs
MicroRNAs (miRNAs) are non-coding RNA molecules 18-24 nucleotides in length that regulate gene expression post-transcriptionally [9]. miRNAs are present in a wide variety of organisms, including plants, vertebrates, insects, and some viruses (for reviews see [10][11][12][13]). At any one time, an organism expresses hundreds of miRNAs that bind by sequence complementarity to target messenger RNA (mRNA) [14,15]. miRNAs mediate mRNA repression by binding to an Argonaute protein and forming a miRNA-induced silencing complex (RISC) where they guide the complex to target mRNAs (see reviews for more details [16][17][18]). Genetic variation within miRNAs may affect their binding efficiency to target genes and in turn modulate gene expression. The effects of SNPs in miR-NAs are just beginning to be investigated, but a handful of disease and phenotypic associations have been detected [19,20]. In mosquitoes, miRNAs have been cataloged, with some functional studies, but the effect of miRNA genetic variation has not yet been examined [12,[21][22][23][24].
Techniques for cataloging miRNAs on a genome-wide scale include microarraybased approaches and small RNA sequencing (sRNA-seq). miRNA microarray profiling is a hybridization probe-based system where miRNAs bind to fluorescent probes containing complementary sequence. This experimental approach cannot measure recently annotated or novel miRNAs, and the low signal-to-noise ratio limits the feasibility to detect lowly abundant miRNA [25]. Due to these technological limitations, the majority of recent studies utilize a next-generation sequencing sRNA-seq approach. sRNA-seq allows for the prediction of novel miRNA as sequence data is mapped directly to the genome [26] and does not rely on existing miRNA catalogs. sRNA-seq also allows for the detection of lowabundance miRNA transcripts and yields data on novel miRNA nucleotide sequence [27]. sRNA-seq differs from standard mRNA sequencing by the addition of a size selection step during the library preparation where the small RNAs are isolated from larger RNA molecules through a gel electrophoresis step [28,29]. There are inherent biases of the sRNA-

MicroRNAs
MicroRNAs (miRNAs) are non-coding RNA molecules 18-24 nucleotides in length that regulate gene expression post-transcriptionally [9]. miRNAs are present in a wide variety of organisms, including plants, vertebrates, insects, and some viruses (for reviews see [10][11][12][13]). At any one time, an organism expresses hundreds of miRNAs that bind by sequence complementarity to target messenger RNA (mRNA) [14,15]. miRNAs mediate mRNA repression by binding to an Argonaute protein and forming a miRNA-induced silencing complex (RISC) where they guide the complex to target mRNAs (see reviews for more details [16][17][18]). Genetic variation within miRNAs may affect their binding efficiency to target genes and in turn modulate gene expression. The effects of SNPs in miRNAs are just beginning to be investigated, but a handful of disease and phenotypic associations have been detected [19,20]. In mosquitoes, miRNAs have been cataloged, with some functional studies, but the effect of miRNA genetic variation has not yet been examined [12,[21][22][23][24].
Techniques for cataloging miRNAs on a genome-wide scale include microarray-based approaches and small RNA sequencing (sRNA-seq). miRNA microarray profiling is a hybridization probe-based system where miRNAs bind to fluorescent probes containing complementary sequence. This experimental approach cannot measure recently annotated or novel miRNAs, and the low signal-to-noise ratio limits the feasibility to detect lowly abundant miRNA [25]. Due to these technological limitations, the majority of recent studies utilize a next-generation sequencing sRNA-seq approach. sRNA-seq allows for the prediction of novel miRNA as sequence data is mapped directly to the genome [26] and does not rely on existing miRNA catalogs. sRNA-seq also allows for the detection of lowabundance miRNA transcripts and yields data on novel miRNA nucleotide sequence [27]. sRNA-seq differs from standard mRNA sequencing by the addition of a size selection step during the library preparation where the small RNAs are isolated from larger RNA molecules through a gel electrophoresis step [28,29]. There are inherent biases of the sRNAseq technique including the effects that GC content and adaptor/barcode sequences can have on the efficiency of cDNA synthesis prior to sequencing (see recent reviews for more details [30,31]). Sequence reads from sRNA-seq are analyzed using bioinformatic tools to predict novel miRNAs and their functions, miRNA structure, phenotypic association and regulatory targets (see recent reviews [27,32]).
The functions of miRNAs in mosquito species are diverse, including regulation of immune response to pathogens and transcriptional regulation during specific life stages or in different tissues. As many mosquito species have the ability to vector human pathogens such as the arboviruses; dengue, Zika, and chikungunya, as well as Plasmodium parasites, understanding the role of miRNAs during the immune responses to these pathogens can be vital in developing novel vector control strategies (see recent reviews [22,33]). A recent study examined the miRNA expression in midguts of Anopheles anthropophagus fed either on non-infected or Plasmodium-infected blood [29]. In the non-infected blood experiment, nine significantly upregulated and 10 significantly downregulated miRNAs were identified, with one (miR-92a) previously reported as induced upon blood feeding in Aedes aegypti [29,34]. Feeding on Plasmodium-infected blood elicited up-and downregulation of an additional 13 and 11 miRNAs, some of which have been identified upon Plasmodium or Wolbachia infection in other mosquito species [29,35,36]. Recent studies have highlighted the complex involvement of miRNAs in the viral response in multiple mosquito vector species with some exhibiting potential proviral effects [37,38] and others potential antiviral effects [21,[38][39][40]. A purely bioinformatic approach has been used to identify potential binding sites of Ae. aegypti miRNA in the chikungunya, dengue, and Zika viral genomes (no experimental validation was attempted [41]). A study of Ae. aegypti miRNA responses to Ross River virus (RRV) infection examined the antiviral response in the fat body and midgut tissues post-inoculation [42] and identified 14 differentially-regulated miRNAs with the majority of differentially expression in fat body at 2 days post inoculation. Prediction of mRNA targets for these miRNAs implicated several genes related to immune response; however, further work is needed to characterize the role of these miRNAs in viral replication [42].
Numerous recent studies have dissected the role of individual miRNAs in mosquito vector competence, mosquito physiology, and insecticide resistance. Work in Anopheles coluzzii showed that blood meal-induced miRNA-276 is integral to the regulation of the mosquito reproductive cycle with silencing of miRNA-276 resulting in increased female fertility and decreased Plasmodium transmission [43]. In Anopheles gambiae, coordinated changes in miRNA expression levels in energy-storing tissues appear to play a role in blood meal-induced metabolic changes observed following feeding [44]. Recent work in Culex pipiens implicated miRNAs in differential susceptibility to deltamethrin insecticides and adult reproductive diapause through their impact on ovarian development and lipid abundance [45,46]. A recent review of miRNA expression and function in mosquitoes summarizes further the roles of individual miRNAs in mosquito biology [12]. Taken together, all of these studies emphasize the important regulatory roles miRNAs play in all aspects of mosquito physiology, vector competence, and insecticide resistance. Further, given their importance in vector competence and insecticide resistance, miRNAs are likely to influence vector control methods currently centered on the use of insecticides.
One remaining challenge in the study of miRNAs is characterizing their functional interaction(s) with the mRNAs they regulate. Recent application of the covalent ligation of endogenous Argonaute-bound RNAs (CLEAR)-crosslinking and immunoprecipitation (CLIP) technique [46] in An. gambiae has begun to explore physical interactions between miRNAs and their target mRNAs [47,48]. The technique results in the simultaneous capture of thousands of miRNA-mRNA target pairs after direct ligation of the miRNA and its cognate target transcript in endogenous Argonaute-miRNA-mRNA complexes. This recent work not only confirmed known interactions between miR-309 and homeobox gene SIX4, but also highlighted many additional interactions for this single miRNA [49]. CLEAR-CLIP assays identified a total of 220 miR-309-mRNA interactions involving 204 distinct mRNA transcripts. CLEAR-CLIP-like approaches are necessary to assign mechanistic function to miRNAs and to specifically identify the mosquito mRNAs they regulate to modulate the whole mosquito phenotype. Knowledge of these interactions will shed light on how genetic variation in either miRNAs or their target mRNAs impacts gene expression.

Long Non-Coding RNAs
One of the lesser-studied non-coding elements are long non-coding RNAs (lncRNAs), defined as transcripts longer than 200 nucleotides that lack amino acid coding potential. lncRNAs have many mRNA-like characteristics, including that they are transcribed by RNA polymerase II, 5' capped, polyadenylated, and often spliced [50]. lncRNAs can be sense overlapping, sense intronic, antisense, or they can be intergenic (intergenic lncRNAs are often called lincRNAs for long intergenic non-coding RNA). Based on data showing abundant expression in only certain cells or tissues, lncRNAs are thought to be more tightly regulated than mRNAs [51,52]. lncRNAs have been linked to various biological functions including both cis and trans regulation of gene expression, development, dosage compensation, and imprinting (see recent reviews [53][54][55]). lncRNAs have also been shown to interact with miRNAs, thereby reversing the effects of miRNAs on mRNA expression. This miRNA sponge role has lncRNAs poised to serve as a tool in controlling miRNA function, potentially in a therapeutic setting [56]. Further, genetic variation in lncRNAs may impact lncRNA expression levels, splicing, and/or the stability of any lncRNA-mRNA interactions [57,58]. While work in model organisms has progressed steadily, the repertoire of lncRNAs in non-model organisms have only recently begun to be explored.
lncRNAs can be cataloged from standard RNA-Seq high-throughput sequencing approaches (detailed methodological and bioinformatic approaches for insect lncRNA discovery reviewed in [59]). Given that lncRNAs tend to be rare compared to mRNA [60], the depth of sequencing necessary to reliably detect lncRNAs should be considered when planning an experiment. Two published studies aimed at cataloging lncRNAs in Anophelines used 223 and 500 million sequence reads [61,62]. As many lncRNAs are expressed antisense to protein-coding genes which they often regulate, it is also recommended to employ stranded RNA-seq approaches. To catalog lncRNAs in Anophelines, the following data analysis pipeline was employed, TopHat [63] was used for read mapping, Cufflinks [64] for annotation, and CuffCompare for comparison with existing genome annotations. Only transcripts with class codes, "i", "u", and "x" denoting intronic, intergenic, and antisense, respectively, were selected as possible lncRNAs. Following mapping and annotation, coding potential was analyzed using one of the available tools, including the Coding Potential Assessment Tool (CPAT) [65], the Coding Potential Calculator (CPC) [66], or PhyloCSF, with CSF standing for Codon Substitution Frequencies [67]. In an Aedes albopictus study, novel lncRNA loci were identified using FEELnc, a platform that predicts lncRNA using a random forest model trained on multi k-mer frequencies and relaxed open reading frames [68]. Differentially-regulated lncRNAs identified from RNA-seq data can be validated using standard qRT-PCR approaches.
Although some insect lncRNAs have been identified with functional roles over the last decade, the majority of the progress in determining the function of lncRNA has been made in vertebrates [69]. In recent years, studies examining lncRNAs in insects have increased, with much of this work focused on insect development, insecticide resistance, and antiviral defense in insect pests [70]. Through a computational pipeline, thousands of lncRNAs have been identified from RNA-seq data of the diamondback moth [71]. Other recent work in Drosophila has highlighted the important role of lncRNAs in development and immunity [72,73].
lncRNAs are known to play a role in sex determination in various organisms, including mammals, fish, crustaceans, and insects [74][75][76][77]. In Drosophila, lncRNAs have been implicated in the activation of expression of the sex determination gene Sex-lethal (Sxl) necessary to determine female sex [78]. In Aedes aegypti, sex determination is regulated by the male determining locus, M, located in a Y chromosome-like region on chromosome 1. Through recent sequencing efforts, the A. aegypti genome assembly and the annotation of this highly-repetitive M locus have drastically improved [79,80]. The improved genome annotation in the M/m sex determination locus highlights a number of putative lncRNA genes. Work to functionally characterize the role of these predicted lncRNAs and their role in sex determination is ongoing. Given efforts to use release of sterile male mosquitoes for vector control, understanding the molecular mechanisms underlying sex determination could be advantageous for efficient enrichment of male mosquitoes.
There are a small number of studies that have both cataloged lncRNAs in mosquitoes and begun to explore their function, particularly as it relates to viral transmission within the Aedes genus. Two studies have implicated lncRNAs in host-arboviral interaction. RNAimediated knockdown of one lncRNA candidate in Ae. aegypti resulted in higher Dengue virus replication [81], and differentially-expressed lncRNAs have been associated with Zika virus infection in Ae. aegypti [82]. A recent cataloging of lncRNAs in Ae. aegypti reported that they shared many of their characteristics with lncRNAs from other species, including low levels of expression, low GC content, short length, and less conservation than protein-coding mRNAs [83]. This catalog also highlights that Ae. aegypti lncRNAs contain a greater fraction of repeat elements as compared to protein-coding mRNAs, and that lncRNAs display highly temporal expression patterns [83]. Recently the same research team did a similar study in the Southern house mosquito (Culex quinquefasciatus) and showed that lncRNAs may play a role in blood meal acquisition in adult females [84]. Work in the Anopheles genus has similar findings to work in Ae. aegypti, including lower sequence conservation in lncRNAs as compared to protein-coding genes, however there is notable conservation in lncRNA secondary structure within the Gambiae complex containing the major malaria vectors in Sub-Saharan Africa, and more divergent secondary structure in the rest of the Anopheles genus [61]. A recent study in Ae. albopictus identified 2632 novel lncRNAs with a small fraction of these showing male-and female-specific expression patterns [85]. Work on lncRNAs in mosquitoes remains relatively novel and as a result, nothing is known about the functional consequence of genetic variation in lncRNAs.

Enhancers
Enhancers are short cis-acting regulatory elements that increase transcriptional levels of target genes by hundreds of fold over the basal level of the core promoter elements [86]. Enhancers control transcriptional activity of a gene, or suite of genes, and are responsible for almost all regulated gene expression in the transcriptome [87,88]. Enhancers can be located near their target gene(s) or megabases distant from the target genes they regulate [89]. Nevertheless, the identities of enhancers and the interacting protein factors that lead to their regulatory function are little known, even in well-studied model genomes [87,88]. An important reason for this is that enhancers cannot yet reliably be predicted by sequencebased algorithms, and until recently, available screening methods were manual and thus limited in scale. Sequence polymorphism of enhancer sequences can cause phenotypic differences, including predisposition to disease, as observed in diverse organisms [2,[90][91][92][93]. At least 70-90% of significantly-associated human GWAS SNPs are estimated to lie within functional enhancers [2,4,94]. At the population level, positively-selected variation at enhancers and other non-coding regulatory elements between species or subgroups likely play an important role in differentiation and evolution [95,96], for example, some of the most diverged sequence of the human genome, as compared to great apes, have been classified as functional enhancers [97]. Very little is known about enhancers in mosquito disease vectors, and nothing is known about non-coding variation and vector phenotypes. A recent review provides a comprehensive summary on studying enhancers in non-model insects [98]. For an in-depth review on chromatin structure and function in mosquitoes, including 3D explorations of the genome using the Hi-C high-throughput sequencing approaches to identify topologically associated domains (TADs), see this recent review [99].
Here, the focus is on direct and indirect experimental approaches to catalog mosquito transcriptional enhancers.

Screening for Enhancers
Despite their known role in gene expression regulation [87,88], until recently there has not been a method for high-throughput, direct, and quantitative screening of DNA sequences for enhancer activity. Indirect screening methods such as ChIP-seq and DNaseseq can infer the presence of enhancers by detecting the open chromatin state correlated with binding of trans-acting factors and histone modification, but do not directly measure enhancer activity [100].
In contrast, functional assays detect enhancers by measuring enhancer activity from a target gene with a measurable readout. The gold standard assay is manual cloning of a candidate enhancer fragment into an expression vector, where the putative enhancer activates a minimal core promoter, driving expression of a luciferase reporter, whose light readout is the measure of enhancer-dependent expression [101]. Enhancers carry the information necessary for their autonomous function, which is preserved even when placed into a heterologous surrounding sequence context such as a reporter plasmid. Self-transcribing active regulatory region sequencing (STARR-seq) assay is a massively parallel reporter assay that detects enhancers directly by their functional properties, querying millions of DNA fragments simultaneously [95,102]. STARR-seq is, in essence, a simultaneous genome-wide luciferase assay, with the exception that it measures enhancer-dependent transcript levels as sequence reads from RNA-seq data, rather than light output due to translated protein.

Direct Methods for Enhancer Discovery
When the goal of an experiment is to discover enhancers or other cis-regulatory elements (CREs), direct methods of regulatory element discovery are often very useful. Such methods find their origins in luciferase assays, where a single DNA sequence is cloned into a vector containing a luciferase reporter construct. This approach is useful for testing one gene at a time and is still considered the gold standard for determining the enhancer activity of a gene. Work in Anopheles stephensi has used a transposon-mediated enhancer detection approach using the Gal4-UAS system, but this approach is labor-intensive and does not explore enhancers on a genome-wide scale [103]. There is growing need in the field to identify regulatory elements and their interactions across the genome. It is nearly impossible to screen enhancer activity on a whole genome scale using single gene luciferase assays, and so efforts to scale up the throughput of luciferase assays brought massively parallel reporter assays (MPRAs) [104,105], which allow for the simultaneous assessment of activity for thousands of enhancers. While an important development in the field, MPRAs have three major drawbacks. First, the MPRA approach uses oligonucleotide arrays to synthesize tested sequences with the maximum length of synthesis limited to 200 bp, rendering the study of enhancers larger than 200 bp infeasible. Second, the insertion of reporter genes into the genome on a large scale often causes substantial positional effects, inhibiting the effectiveness of such assays. Finally, enhancer activity cannot be analyzed quantitatively, as MPRAs provide only binary information results (active/inactive) [105].

STARR-Seq
Self-transcribing active regulatory region sequencing (STARR-seq) is a method of directly discovering and quantitively assessing enhancer activity on a genome-wide scale. STARR-seq identifies active, chromatin-masked, and dormant enhancers by assaying enhancer activity of genomic fragments episomally. Briefly, genomic DNA is fragmented, and linkers are added to fragment ends. This library of fragments is then cloned into a vector downstream of a core promoter, the vector library is transfected into cells, and after 24 h, RNA is harvested, and a cDNA library generated. Genomic DNA is simultaneously harvested to control for differential transfection efficiencies. Cloned fragments with enhancer activity will drive expression of themselves and resulting sequence output will both identify enhancers and quantify their activity. This method allows for the simultaneous screening of the entire genome for enhancer activity [95,102,106]. There are a number of available methods for analysis of STARR-seq data and identification of enhancer peaks [107,108]. Drawbacks of STARR-seq are twofold; the first being that many enhancers are "context dependent", meaning that their position in the genome is important, and the STARR-seq approach removes DNA fragments from their genomic context. Enhancers may interact with other nearby regulatory elements, or distal regulatory elements that are brought to interact with an enhancer through changes in the chromatin structure. The second being that this method discovers all enhancers within the tested DNA, making it difficult to determine which enhancers are relevant to a condition [105]. Despite these limitations, this method, capitalizing on next-generation sequencing approaches to comprehensively query enhancer activity on a genome-wide scale, generates a comprehensive catalog of an organism's enhancers.
STARR-seq has been used in Drosophila to comprehensively characterize and compare transcriptional enhancers across five closely-related species [95]. This seminal work concludes that there is a good degree of evolutionary conservation in enhancer activity, as well as frequent gains in enhancer function since divergence from the common ancestor. Work in An. coluzzii has examined the impact of naturally-segregating genetic variation in a small number of enhancers with potential roles in mosquito development, immunity, and insecticide resistance [109].

Indirect Methods for Enhancer Discovery
Indirect approaches to discovering cis-regulatory elements operate through the detection of open chromatin. These indirect methods are predicated on the knowledge that active regulatory elements exist within open chromatin. A variety of methods now exist that either tag or remove open DNA (or both tag and remove), allowing this portion of the genomic DNA to be selectively sequenced. As chromatin can be open both constitutively and conditionally, an indirect approach to regulatory element discovery is valuable in the detection of genomic structural changes that may impact gene expression. This section explores a number of indirect methods for the detection of enhancers/cis-regulatory elements.

ATAC-Seq (Single Cell Capable)
Assay for transposase-accessible chromatin using sequencing (ATAC-seq) is one method of detecting open chromatin and enhancers. This method employs enzymatic manipulation of DNA, specifically using a hyperactive Tn5 transposase to cut and "tag" open DNA with adaptor sequences [110]. The method has gained increasing popularity due to its need for small amounts of input DNA and shorter experimental run time (less than three days) [111]. As mentioned previously, the open DNA sequence is bound to a hyperactive derivative of Tn5 which is flanked by 19 bp sequences called mosaic ends (MEs). These MEs are specific to the sequence around the insertion-site DNA. The open DNA is subsequently cut by Tn5 transposase derivatives, and the MEs remain attached, tagging the cut DNA with a specific sequence [112]. This "tagmented" DNA is subsequently purified, amplified, and sequenced. ATAC-seq is a method that can be done at the scale of the single cell, which affords very fine-scale characterization (see review of the single cell approach here [113]). While ATAC-seq has become a more commonly-used approach in the last five years, the method has its own set of drawbacks, including the amplification of non-nuclear, particularly mitochondrial DNA [110,114]. Methods for analysis of ATAC-seq data are evolving, and a recent publication provides an up-to-date review of current methods, including quality control steps, peak identification, and identification of differential peaks [115].
Use of ATAC-seq in mosquitoes is limited to two studies on Ae. aegypti and An. gambiae [80,116]. In Ae. aegypti, the method was adapted from the original protocol published in 2013 [111,117] for use on Ae. aegypti brains to map CRE at predicted transcription start sites in the updated genome, AaegL5 [80]. In An. gambiae, genome-wide profiling of chromatin accessibility was done using the salivary glands and midguts of Plasmodiuminfected females. ATAC-seq was used in combination with RNA-seq and ChIP-seq data to demonstrate that chromatin accessibility was greatest in promoter regions and introns, and that these open regions also correlated with tissue-specific gene expression. The study identified potentially important regulatory regions within the An. gambiae genome [116].

ChIP-Seq
Chromatin immunoprecipitation (ChIP-seq) is another method of detecting open chromatin, and thereby indirectly cataloging CREs. ChIP-seq starts with resident proteins being crosslinked to the DNA. DNA is then sheared using sonication, incubated with antibodies, and immunoprecipitated. Immunoprecipitated DNA is then amplified and sequenced [118]. ChIP-seq is the oldest method for cis-regulatory element/enhancer discovery, and has reliably produced high-resolution results [118]. However, ChIP-seq can be difficult to use for some laboratories or with some organisms, due to its high time cost, its need for large amounts of DNA, and the need for highly-specific antibodies that are not always readily available [114,119].
In An. gambiae, ChIP-seq combined with RNA-seq have been used to study the chromatin modifications accompanying Plasmodium infection [120]. Most of this work used histone modification markers known to associate with promoters. A comprehensive look at enhancers would require the use of chromatin marks, such as H 3 K 4 me1, known to be enhancer-associated [120]. Earlier work in An. gambiae cemented a correlation between the histone modification marks, H 3 K 27 ac and H 3 K 27 me3, and increased/decreased gene expression, respectively [121]. Work done by Lukyanchikova et al. examined the 3D architecture of five Anopheline mosquito species [122]. Much of the analysis was performed using Hi-C to examine new looping interactions in the 3D genome. Chromatin loops had been previously associated with the polycomb group of proteins that largely function in maintaining cell positional identity. To examine this association, ChIP-seq was performed on Anopheles atroparvus, revealing that some of the looping structures were anchored in H 3 K 27 me3-enriched silencing regions. ChIP-seq has also been used in Cu. pipiens to determine 72 new targets of the forkhead transcription factor (FOXO) [123]. Two important signaling pathways appear to hinge on the presence of FOXO in order to transition adult Cu. pipiens mosquitoes into their overwinter diapause. Discovery of these new target genes represent an expansion in the previous knowledge of FOXO interactions.

DNase-Seq (Single Cell Capable)
DNase-seq is the second oldest method of indirect regulatory element discovery and is another reliable, well-documented method [124]. DNase-seq finds its roots in DNasefootprinting, a technique that similarly uses DNase I to digest DNA but culminates in a gel electrophoresis step, relying on DNA-fragment sizes to report the "footprint" of binding proteins [125]. DNase-seq advances the merits of its predecessor by providing an even more detailed initial look at chromatin structure where there may have been no previous understanding. These advances lie in the coupling of DNase-footprinting with high-throughput sequencing approaches [126]. The method uses slightly less DNA than ChIP-seq, but there is a risk of enzymatic cleavage bias that may skew the results [114]. DNase-seq begins with a Dnase I chromatin digestion where open DNA is selectively excised. This cleavage reaction is stopped when it is loaded onto a low-melt agarose gel and subjected to electrophoresis. The desired bands are removed from the gel, and the open DNA within them is amplified and sequenced by high-throughput sequencing technologies [114,126,127]. There are no published uses of DNase-seq in mosquito disease vectors, and only one published use of DNAse-footprinting [128].

FAIRE-Seq
Formaldehyde-assisted isolation of regulatory elements sequencing (FAIRE-seq) is a method of indirect regulatory element discovery commonly used due to its straightforward application, low cleavage bias, and its ability to be applied to many different cell types [126]. There are three basic steps to FAIRE-seq: first, the DNA, similar to in ChIP-seq, is crosslinked and sheared; second, the non-crosslinked DNA is phenol-chloroformextracted and third, this DNA is then amplified and sequenced [129,130]. Though the procedure is straightforward, FAIRE-seq generates a low signal-to-noise ratio that can make data processing difficult. Additionally, the variable length of time required for the formaldehyde fixation step can make it hard to plan experiment time effectively [126,129,131].
FAIRE-seq was used to generate a genome-wide map of regulatory elements in Ae. aegypti. Very interestingly, of the large number of single nucleotide polymorphisms identified in mosquito strains susceptible and resistant to dengue virus, more than a quarter of these SNPs overlap with regulatory peaks, suggesting that variation in regulatory sequences can contribution to variability in the susceptibility to dengue infection [131].
In An. gambiae, FAIRE-seq was used in a study of cis-regulatory elements involved in innate immune function. Sequences for new CREs were discovered and may prove useful in predicting protein-protein interactions in the An. gambiae immune responses [132].

MNase-Seq (Single Cell Capable)
MNase-seq, while also an indirect method of enhancer identification, is different in that it does not involve fragmenting open chromatin, but rather is designed to cleave and degrade internucleosomal DNA [110]. Micrococcal nuclease (MNase) is an endoexonuclease derived from Staphylococcus aureus, and its first use, to determine chromatin structure, dates back to 1975 [133]. The first instance of MNase paired with high-throughput sequencing, however, was in 2009 [134]. The technique begins with the digestion of genomic DNA with MNase to extract mononucleosomes. Following this, the DNA from the DNAprotein complexes are extracted and used to prepare a sequencing library. High-throughput sequencing then provides the genomic location of regulatory DNA-binding proteins in the genome [126]. One drawback of MNase-seq is the potentially variable digestion of MNase. Activity of the enzyme can be highly dependent on MNase concentration, making results potentially highly variable, even across experimental replicates [110,126,135]. There are no published uses of MNase-seq in mosquito disease vectors. MNase-seq has been used in the malaria parasite, Plasmodium falciparum to study the role of nucleosome positioning in the regulation of gene transcription [136]. In Drosophila, MNase-seq has been used to track changes in the nucleosome occupancy in response to immune stimulation [137].

NOMe-Seq
Nucleosome occupancy and methylome sequencing (NOMe-seq) is a more recent technique that is notable for its ability to both detect nucleosomes occupancy and methylation patterns in DNA [138]. NOMe-seq is performed by fixing cells and shearing the DNA to >1 kb fragments. The enzyme M.CviPI is then used to methylate unprotected GC dinucleotides in accessible DNA. Next, a bisulfite conversion is performed to convert all unmethylated cytosine into uracil. The prepared DNA is then purified, amplified, and sequenced [138]. NOMe-seq finds its one drawback in the need for specific DNA fragment sizes to prevent bias towards CpG islands [139]. There are no published uses of DNase-seq in mosquito disease vectors or in other insects. In the nearly 10 years since it was first introduced, there are only 19 published papers using this method (see Table 1).  a Date of first and total references were determined in Dec 2020 by searching PubMed using the protocol name in quotations, followed by the field term that searches the title and abstract for the protocol name: [TIAB] (i.e., "ATAC-seq" [TIAB]); b M.CviPI is an enzyme that methylates GpC dinucleotides; c One clock signifies <3 days of wet lab work; two clocks, 3 days of wet work, and three clocks, >3 days of wet work; d One mosquito indicates <100 individual mosquitoes used in a single experimental sample; two mosquitoes, 100 individuals, and three mosquitoes, >100 individuals; ND = not determined.   a Date of first and total references were determined in Dec 2020 by searching PubMed using the protocol name in quotations, followed by the field term that searches the title and abstract for the protocol name: [TIAB] (i.e., "ATAC-seq" [TIAB]); b M.CviPI is an enzyme that methylates GpC dinucleotides; c One clock signifies <3 days of wet lab work; two clocks, 3 days of wet work, and three clocks, >3 days of wet work; d One mosquito indicates <100 individual mosquitoes used in a single experimental sample; two mosquitoes, 100 individuals, and three mosquitoes, >100 individuals; ND = not determined.  a Date of first and total references were determined in Dec 2020 by searching PubMed using the protocol name in quotations, followed by the field term that searches the title and abstract for the protocol name: [TIAB] (i.e., "ATAC-seq" [TIAB]); b M.CviPI is an enzyme that methylates GpC dinucleotides; c One clock signifies <3 days of wet lab work; two clocks, 3 days of wet work, and three clocks, >3 days of wet work; d One mosquito indicates <100 individual mosquitoes used in a single experimental sample; two mosquitoes, 100 individuals, and three mosquitoes, >100 individuals; ND = not determined.  Date of first and total references were determined in Dec 2020 by searching PubMed using the protocol name in quotations, followed by the field term that searches the title and abstract for the protocol name: [TIAB] (i.e., "ATAC-seq" [TIAB]); b M.CviPI is an enzyme that methylates GpC dinucleotides; c One clock signifies <3 days of wet lab work; two clocks, 3 days of wet work, and three clocks, >3 days of wet work; d One mosquito indicates <100 individual mosquitoes used in a single experimental sample; two mosquitoes, 100 individuals, and three mosquitoes, >100 individuals; ND = not determined.  a Date of first and total references were determined in Dec 2020 by searching PubMed using the protocol name in quotations, followed by the field term that searches the title and abstract for the protocol name: [TIAB] (i.e., "ATAC-seq" [TIAB]); b M.CviPI is an enzyme that methylates GpC dinucleotides; c One clock signifies <3 days of wet lab work; two clocks, 3 days of wet work, and three clocks, >3 days of wet work; d One mosquito indicates <100 individual mosquitoes used in a single experimental sample; two mosquitoes, 100 individuals, and three mosquitoes, >100 individuals; ND = not determined.     [131] Anopheles gambiae: [132] Low signal to noise ratio, variation in formaldehyde fixation step [126] MNase-seq  [131] Anopheles gambiae: [132] Low signal to noise ratio, variation in formaldehyde fixation step [126] None Does not capture conditional states, catalogs all enhancers [105] a Date of first and total references were determined in Dec 2020 by searching PubMed using the protocol name in quotations, followed by the field term that searches the title and abstract for the protocol name: [TIAB] (i.e., "ATAC-seq" [TIAB]); b M.CviPI is an enzyme that methylates GpC dinucleotides; c One clock signifies <3 days of wet lab work; two clocks, 3 days of wet work, and three clocks, >3 days of wet work; d One mosquito indicates <100 individual mosquitoes used in a single experimental sample; two mosquitoes, 100 individuals, and three mosquitoes, >100 individuals; ND = not determined.

Enhancer RNAs
Distinct from the enhancers detected by either indirect and direct methods are enhancer RNAs (eRNAs), which are non-coding RNA molecules transcribed from enhancer regions of the genome and comprises two main classes, 1D eRNAs and 2D eRNAs. These two classes of eRNAs differ in size, polyadenylation, and direction of transcription. 1D eRNAs are 3-4 kb in length, polyadenylated, and unidirectional, while 2D eRNAs are typically less than 2 kb, nonpolyadenylated, and bidirectional [142]. The functional role of eRNAs is not well characterized [143], but there does appear to be an association between eRNA expression and enhancer activity [144]. Previous data would suggest that eRNAs are able to self-transcribe and act as transcription factor complexes both in cis and trans, and may be necessary to help chromatin maintain its open state [145]. The process for discovering eRNAs generally includes two steps, the use of an indirect CRE discovery method, such as ChIP-seq, coupled with RNA-seq. Nothing is known about eRNAs in mosquitoes, but they likely play an important role in the regulation of gene expression in Drosophila [146].

Concluding Remarks
With continuously improving technological approaches, efforts to functionally characterize the non-coding genome in mosquito disease vectors are advancing. With regulatory elements such as miRNAs, lncRNAs, and enhancers identified and cataloged, efforts will shift to characterizing the functional consequence of genetic variation in these elements. A combination of direct and indirect experimental approaches will generate the most comprehensive picture of non-coding regulatory elements, their dynamic interactions with coding elements, and their impact on organism phenotype. Efforts such as the large-scale Ag1000 genomes sequencing project [147] will also aide in cataloging naturally-segregating variation in these non-coding regions. Funding: This work was funded by financial support to MMR from National Institutes of Health, NIAID #AI145999.