Analysis of Polyadenylation Signal Usage with Full-Length Transcriptome in Spodoptera frugiperda (Lepidoptera: Noctuidae)

Simple Summary RNA polyadenylation is an important process in mRNA maturation. The process is controlled by various cis-acting elements surrounding the cleavage site, and their binding factors. Recently high sequencing technology especially full-length transcriptome provides a large amount of sequencing data for us to explore the variations of poly(A) signals, alternative polyadenylation (APA) in Spodoptera frugiperda. We studied 50,616 polyadenylation signals using full-length transcriptome and EST data. These data show that 51.64% of the 50,616 pA sites had the conserved AAUAAA hexamers, while 10.13% of pA sites had none of the AAUAAA-like hexamers. Among these genes, more than 64.76% have more than one pA site. Our results also support that APA plays a significant role in increasing transcriptome diversity and gene expression regulation in Spodoptera frugiperda. Our dataset was the first polyadenylation signal analysis in the Lepidoptera and would provide a theoretical basis for pest control. Abstract During the messenger RNA (mRNA) maturation process, RNA polyadenylation is a key step, and is coupled to the termination of transcription. Various cis-acting elements near the cleavage site and their binding factors would affect the process of polyadenylation, and AAUAAA, a highly conserved hexamer, was the most important polyadenylation signal (PAS). PAS usage is one of the critical modification determinants targeted at mRNA post-transcription. The full-length transcriptome has recently generated a massive amount of sequencing data, revealing poly(A) variation and alternative polyadenylation (APA) in Spodoptera frugiperda. We identified 50,616 polyadenylation signals in Spodoptera frugiperda via analysis of full-length transcriptome combined with expression Sequence Tags Technology (EST). The polyadenylation signal usage in Spodoptera frugiperda is conserved, and it is similar to that of flies and other animals. AAUAAA and AUUAAA are the most highly conserved polyadenylation signals of all polyadenylation signals we identified. Additionally, we found the U/GU-rich downstream sequence element (DSE) in the cleavage site. These results demonstrate that APA in Spodoptera frugiperda plays a significant role in root growth and development. This is the first polyadenylation signal usage analysis in agricultural pests, which can deepen our understanding of Spodoptera frugiperda and provide a theoretical basis for pest control.


Introduction
RNA polyadenylation is crucial to gene expression in eukaryotes. This mechanism is achieved by adding a 3 poly(A) tail to the 3 end of mRNA precursor cleavage site (CS). It not only affects the stability of mRNA, but also has great significance in post-transcriptional processes such as mRNA translation and trans-nuclear membrane transport [1][2][3][4][5]. It can be affected by a variety of factors, including self-carrying elements and abundant transacting protein factors. Most importantly, two core cis-elements are the sites for protein recognition. The first core cis-element is known as the polyadenylation site (PAS), which is a highly conserved AAUAAA hexamer motif and located 10-35 bp upstream of the 5 end of the cleavage site, and the other is a variable sequence rich in U/UG located in 15-30 bp downstream of the 3 end of the cleavage site [6][7][8][9]. During polyadenylation, they are bound by two protein complexes. Cleavage and polyadenylation specificity factors (CPSF) bind to protein complex via PAS, and U/UG-rich element is bound by cleavage stimulation factors (CstF) [10]. Algorithmic analysis of transcripts with poly(A) tails reveals that the hexamer motifs of PAS in humans, mice, Schmidtea mediterranea, and Drosophila melanogaster are not constant; 40-59% of PAS in mRNAs use the AAUAAA as motifs, 25-40% use mononucleotide variants of AAUAAA as motifs, 13-25% have no recognizable canonical motif [10][11][12][13].
Many genes have multiple polyadenylation (PA) sites, which adds uncertainty to mRNA cleavage. After variable polyadenylation (APA) of pre-mRNA, different types and lengths of mRNA subtypes will be formed, including mRNA with 3 noncoding regions (3 UTRs) of different lengths and transcripts with different coding sequences [6,14]. No matter which mRNA isoform is formed, it will have a great impact on the mRNA itself and protein translation. The difference in APA coding sequence will lead to the change of the C-terminal amino acid sequence of a protein, causing proteins to have different functions [11,[15][16][17]. Even if the amino acid sequence remains, differences in the sequence of UTR-APA will affect the stability of mRNA and change the binding ability of RNAbinding protein or microRNA [9,18]. Therefore, the cis-element of a specific gene with multiple PA sites is of great significance for the transcription of mRNA precursors, and different PA sites determine how the gene is regulated [19,20].
So far, the first insect that has been used by scientists for PAS analysis is Drosophila melanogaster [12,21,22]. Drosophila melanogaster polyadenylation signal has a diffuse A-rich region, including the AAUAAA motif, extending to 40 nt upstream from the cleavage site. In both mammals and the fly, there is a U-rich region just downstream of the cleavage site, corresponding to the CstF binding region. In addition, the distribution seems to be similar, with means of 17, 16, and 17 nt in human, mouse, and fly respectively. Furthermore, insects are the largest group of animals, and their populations have diversity [23]. PAS usage may vary in insects, so more insect species need to be identified and analyzed for PAS.
Spodoptera frugiperda belongs to the Lepidoptera (Noctuidae). It is an omnivorous and gluttonous agricultural pest which can damage more than 353 kinds of grain and cash crops in nearly 76 families [24]. Meanwhile, Spodoptera frugiperda originated from America, and spread to Africa in 2016 and China in 2019, causing great damage, especially to corn and rice [25,26]. With the rapid development of high-throughput sequencing technology in the past 10 years, its genome has been sequenced and more than 9 genome versions have been published [27][28][29][30][31][32][33][34]. However, the genome annotated work is not ideal. Difficulties arise due to insufficient genomic data on the 3 -end processing of annotated protein-coding genes [32].
Here, we utilize the systematic mapping of the raw data from Spodoptera frugiperda ESTs and full-length transcriptome data to the genomes and then analyze PAS and DSE signals with a general process. Our research analyzes the main traits of mRNA PAS signal and APA genes. Therefore, our poly(A) study in Spodoptera frugiperda will deepen the understanding of RNA post-transcriptional regulation and assist in the annotation of the Spodoptera frugiperda genome.

The Source of the Sequencing Data
Full-length transcriptome data of Spodoptera frugiperda were downloaded from the NCBI Sequence Read Archive database (https://www.ncbi.nlm.nih.gov/sra/, accessed date: 19 March 2022), and the accession numbers are SRR14569375.1, SRR14569378.1, and SRR14621792.1 [35]. The full-length transcriptome data from SRA database was used PacBio single-molecule real-time (SMRT) sequencing aimed at revealing the full-length transcriptome profiling of the FAW larval brain to obtain detoxification genes [35]. EST (Expressed Sequence Tag) data of Spodoptera frugiperda were searched and downloaded from NCBI (https://www.ncbi.nlm.nih.gov/nuccore/, accessed date: 1 August 2021), and sequence is presented in Dataset S1.

Identification of Poly(A) Sites
In this process, putative novel polyadenylation sites were identified. To achieve it, we first ascertained all sequencing reads starting or ending with adenine or thymine that had a minimum length of 10 nt, which followed Zhang [2] and Tian's [11] process. Next, we removed the terminal As or Ts. We then extracted 200 bp sequences upstream of the polyadenylation site. After the above steps, we aligned both EST data and full-length transcriptome data with the Spodoptera frugiperda genome (ZJU_Sfru_1.0 (NCBI project 6098571, GCA_011064685.1)) through bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2 /index.shtml, accessed date: 1 June 2018, Version 2.4.3, Selectable parameter: -p 10-scoremin L, −0.3, −0.3 -v 1). The reads were mapped uniquely to the genome in this phase and then utilized samtools (http://www.htslib.org/, Version 1.1.0, accessed date: 1 June 2018) to determine the specific base at which cleavage took place. To remove apparent cleavage sites potentially resulting from sequencing errors (As and Ts in the genome are erroneously considered as poly(A) tails), we deleted cleavage sites where the 10 bp downstream cleavage site genomic regions contained equal to or more than 9 As or Ts. Ultimately, we extracted the 100 bp upstream and downstream sequence for further research.

Polyadenylation Signal Search and Analysis
We analyzed single-nucleotide profiles using 100 bp upstream and 100 bp downstream of the cleavage site. Next, we did an analysis for all 6 hexamer motifs from 0-50 bp and the data of these motifs included there, along with the number of related genes. After completing this work, we got the top 18 motifs that had only one nucleotide difference. Then, the AATAAA and AATAAA with one bp variation frequency at the upstream of cleavage site were further examined. Additionally, the cleavage site downstream elements (such as U/GU rich element) were explored simultaneously at 100 bp downstream of the cleavage site. The perl script is presented in Script S1.

APA Analysis in the Spodoptera frugiperda Genome
To analyze the global distribution of APA in Spodoptera frugiperda, the gene annotation information of Spodoptera frugiperda was downloaded (http://v2.insect-genome.com/ Organism/715, accessed date: 1 March 2022) [32]. We compared the CS position with the gene location on the chromosome, and the gene containing two or more cleavage sites means the gene was potentially regulated by alternative polyadenylation. After this, the gene containing the PAS number was analyzed.

PAS Information Annotation
The PAS positions in the chromosome were obtained by extracting the mapped data of the poly(A) tail reads. Next, the gene number in each of the 32 chromosomes was then extracted using the genome annotation file, and PAS in different gene positions were also examined.

Data Analysis
Chi-square (χ 2 ) test was used to determine if the frequency of A, C, G, and U nucleotide around the polyadenylation cleavage site is significantly different from the expected A, C, G, and U frequency in Spodoptera frugiperda genome (GC content in Spodoptera frugiperda genome was 36.4%, G or C frequency should be 18.2%, A or U frequency should be 31.8%).

Identification of Polyadenylation Sites
The raw ESTs and full-length transcriptome for Spodoptera frugiperda contained a total number of 65,423 ESTs and 7,510,007,630 raw full-length transcriptome reads (Table 1). Next, we removed the low-quality reads which started or ended with As or Ts less than 10, and finally we obtained a sum of 35,853 EST sequences and 375,500,381 full-length reads with poly(A) tails (started or ended with As or Ts more than 10). These reads with poly(A) tails were then mapped to the Spodoptera frugiperda genome (ZJU_Sfru_1.0 (NCBI project 6098571, GCA_011064685.1)), yielding a total of 10,757,513 uniquely mapped cleavage sites (28,931 from 35,853 EST reads and 10,728,582 from 375,500,381 full-length reads). A total of 50,616 pA sites (Table S1, 10,245 from 28,931 EST reads and 40,371 from 10,728,582 fulllength reads) were discovered when internally primed cleavage sites were discarded (at least eight consecutive A in the genome aligned downstream of the cleavage site) and those sites that were closely separated by less than or equal to 20 bp were merged (Table 1, Figure 1B). There was only one cleavage site found in 35.34% of them and no heterogeneity was observed ( Figure 1A). Moreover, two (34.47%) and several (30.19%) alternative cleavage sites were discovered at the remaining pA sites ( Figure 1A).

Nucleotide Bias Close to the Cleavage Sites
To more clearly show the above core cis-elements, including 50,616 polyadenylation sites, we analyzed the single nucleotide profiles of 50,616 polyadenylation sites within 100 bp upstream and downstream of the corresponding cleavage sites. Under completely random conditions, A, U, C, and G, the frequency of four nucleotides appearing in every position of this region should be 25%. Considering the GC content in Spodoptera frugiperda genome was 36.4% (G or C frequency should be 18.2%), and A and U frequency were close to 63.6% (A or U frequency should be 31.8%). However, it was discovered that the frequency of U remained close to 30%, except for position 0 (the cleavage site), where it was found to be about 22.37% (no significant, p = 0.15). There were three frequency peaks for U and they were respectively located at position 3-27, −6 to −4, and −9 to −12. Coincidentally, the downstream core U/UG-rich sequences that CstF can particularly identify during polyadenylation coincided with the rise to the 3-27 U peaks. Except for the peaks corresponding to 0, −32 to −14, and −7 to −5, and a trough corresponding to 3-27, the probability of A at other positions remained around 25% and most regions were lower compared to the U. The A-rich region at position −32 to −14 corresponding to the hexamer element like core AAUAAA bound by CPSF. Furthermore, the CFI and CFII complex identified the cleavage site indicated by the A peak at position 0 (frequency 62.37%, p = 0.0001544). In comparison, C and G remained below 25% and close to 20%, except for positions −1 (C and G) and 4-5 (G only), in which they reached 25% or slightly higher (Figure 2A). The downstream U/GU-rich element corresponded to the G and U peaks at positions 4-5 and 3-27, respectively.   (Figures 2B and S1A). Moreover, AA-enriched positions were comparable to single nucleotide A in tendency, according to the dinucleotide profiles ( Figures 2B and S1A). In addition, the AA rich region showed an excessive expression with a frequency of >0.08 from position 23-100, while other dinucleotide such as AC, AG, and AU had no significant tendency ( Figure S1A). What is more, the region −1 to 1 occurred a sharp CA peak and proved that it was a cleavage site. However, the CC, CG, and CU had no such condition. Although the tendency of the GA was similar to that of the CA, it has a smaller proportion. Meanwhile, a sharp peak for UA appeared in the region of −1 to 1, and a slightly wider peak appeared in the region of −14 to −24. In accord with the downstream U/GU-rich element, the UU rich significantly occurred at regions from −4 to −13 and 10 to 26, with a focus on 14 to 22. Moreover, the frequency of the remaining dinucleotides at positions 1-100 was between 0.03 and 0.08, which appeared random (Figures 2B and S1A).

Sequence Motifs near the Cleavage Sites
A total of 50,616 pA sites previously identified were screened for all potential 6base hexamers 100 bp upstream or downstream of the cleavage sites in order to determine the different forms of polyadenylation signals present in Spodoptera frugiperda ( Figures 2C and S1B). A hexamer's scattered distribution at low frequency would be viewed as noise that happens at random. Hexamer signals, on the other hand, were defined as high-frequency peaks in a hexamer distribution. According to the above criteria, compared to the other hexamers, the motif AAUAAA showed the highest sharp peak at position −11 to −30. This phenomenon suggested that AAUAAA is the most important polyadenylation signal motif. Moreover, while the sharp peaks at positions −11 to −30 were present in the single-nucleotide variations of the motif AAUAAA, such as AUUAAA, ACUAAA, and AGUAAA, they were blunter and their values were not as high as those of AAUAAA. Except for these, the other variants, including UAUAAA, AAUAUA, AAUACA, CAUAAA, GAUAAA, and AAUAGA, also displayed peak values at position −11 to −30, which indicated that they were PAS as well, but had lower kurtosis than the above-mentioned hexamers. In addition, there was also a peak upstream of the cleavage site for the AAAUAA and AUAAAA, though they might be affected by the AAUAAA motif due to a shared portion. The variation UUUUUU could not be considered a real PAS since, unlike other hexamers, it exhibited no distinct peak and merely a dispersed distribution.

Usage Frequencies of Core Hexamers Polyadenylation Signal
In the 40 nt upstream of the 50,616 collected poly(A) sites, 89.87% of them contained at least one type of classical PAS hexamer ( Table 2). In addition, it showed that the usage frequency of the most fundamental motif AAUAAA was 51.64% in Spodoptera frugiperda, lower than that of humans and mice [11]. Except for the most common hexamer AAUAAA, we found that single-nucleotide variant AUUAAA was the second most commonly used hexamer (12.15%). The 14 PAS motifs in the table were unable to be found in the remaining 10.13% of the pA sites called none, showing a higher ratio than humans.

pA Sites Distribution in the Spodoptera frugiperda Genome
Firstly, the distribution of the 50,616 pA sites above and the 25,699 annotated proteincoding genes [32] on the 32 autosomes and sex chromosomes were compared ( Figure 3A). In the 32 chromosomes, the number of pA sites was much higher than that of genes. The pA sites to gene number ratio was from 1.49 to 2.49 in all chromosomes except for the Chr32 chromosome, and the Chr32 pA sites to gene number ratio was up to 9. In general, the number of pA sites was roughly correlated with that of genes ( Figure 3A). Chr32 was the W chromosome, 8 genes or transcripts and 72 pA sites were observed, which was significantly lower than other chromosomes in genes and pA sites number. Then, we also found that 69.26% (35,055 out of 50,616) of the pA sites were distributed in 15,218 (out of the 25,699) annotated protein coding genes and 30.74% (15,561) were located in the regions of non-annotated genes, respectively ( Figure 3B,C). Among the 35,055 pA sites, which were associated with 15,218 genes, most of them (number and percentage of 79.07%) were found in 3 UTR s ( Figure 3B). The second and third pA site-containing sections, with 15.07% and 4.99% pA sites, respectively, were home to the introns and exons of these genes. We also found 0.87% pA sites in the 5 UTR region of these genes ( Figure 3B). Among the 25,699 annotated transcripts mapped with pA sites, excluding unknown genes, 70.41% of them have two or more alternative pA sites ( Figure 3C), resulting in 3 encoded or UTR sequences of variable length.

GO Analysis and KEGG Analysis of Alternative Polyadenylation Gene in Spodoptera frugiperda
In order to better analyze the biological functions and distribution characteristics of the APA gene of Spodoptera frugiperda, we used bioinformatics methods to conduct GO analysis and KEGG analysis on the screened genes with two or more PA sites (accounting for 70.41%) (Figures 4, S2 and S3). In the column of biological processes, the number of genes enriched by three biological processes, namely, metabolic process, cellular process, and single organization process, is the most prominent, with more than 600 ( Figure 4A). In the cell localization column, APA genes are mainly concentrated in five parts: membrane, cell, cell part, membrane part, and macromolecular complex, of which the number of genes concentrated in the membrane is the largest ( Figure 4A). In the column of molecular function, binding and catalytic activity enrich the most genes. It can be seen that the protein encoded by the APA gene often performs specific biological functions such as binding protein and catalytic protein ( Figure 4A).
In KEGG analysis, we selected the top 20 genes with the highest enrichment factor as a reference. The results showed that the number of genes enriched in endocytosis, ribosome, ribosome biology in eukaryotes and other pathways was the largest, but the degree of enrichment was relatively low ( Figure 4B). The phylalanine, tyrosine, and tryptophan biosynthesis pathways were the most enriched ( Figure 4B). It can be seen from the results that most of the 20 KEGG pathways are metabolic pathways, indicating that the protein encoded by the APA gene is of great significance for cell metabolism, which is consistent with the results of the analysis.
Through GO analysis and KEGG analysis, it can be seen that the protein encoded by the APA gene bears important biological functions and strictly controls the process of cell metabolism. The in-depth study of APA mechanisms has a guiding significance for regulating metabolism and changing traits.

Different Alternative Polyadenylation Types
Variable tailing can be divided into many types according to different positions of genes. Some will change the protein sequence, and some will not change the protein sequence but will change the RNA regulatory elements. We selected four genes (gene number: LOC118274515, LOC118275636, LOC118264954, LOC118275420) to explore the regulation of variable tailing on genes. LOC118274515, a short chain specific acyl CoA dehydrogenase, has a variable tailing site in the 5 UTR, primarily one in the 3 UTR. If the variable tailing of 5 UTR is mainly used, this gene will only transcribe some non-coding transcripts ( Figure 5A). LOC118275636 was most likely peroxisomal acyl-coenzyme A oxidase 1, whose first pA site was in the ORF, the second in the 3 UTR ( Figure 5B). If it mainly uses the first tailing site, the protein sequence will be shortened, which may change the function of the protein. LOC118264954 was found to be cadherin-related tumor suppressorlike, whose first pA site is in an intron, and another one is in 3 UTR ( Figure 5C). If it mainly uses the first tailing site, the protein sequence will be shortened, which may change the function of the protein. LOC118275420 was Acyl-CoA dehydrogenase ( Figure 5D). Both pA sites were in 3 UTR. Its variable tailing does not lead to different protein sequences, but it contains many regulatory elements, such as miRNA recognition sites, in order to alter the length of 3 UTR.

Discussion
With the rapid development of high-throughput sequencing technology, researchers have obtained massive amounts of sequencing data, including genomic and transcriptome, which allows researchers to explore the regulation mechanism of genes. In this study, a complete collection of the poly(A) site was performed, followed by a preliminary study of its polyadenylation signal, polyadenylation, and gene regulation properties at the APA level. This study lays the foundation for further research on the RNA modification of polyadenylation and the genetic regulation of APA in Spodoptera frugiperda.
As described previously [35], the cleavage and polyadenylation sites of Spodoptera frugiperda were identified to perform the first analysis of PAS usage. To achieve it, over 10 million individual tags (28,931) and reads (10,728,582) were mapped, and all of the 50,516 CSs were identified. Although the genome of Spodoptera frugiperda has been annotated, the various isoforms produced by APA could not be distinguished by it. A major finding is that APA is widely used in Spodoptera frugiperda.
There are currently many methods for identifying cleavage and polyadenylation sites [6,[36][37][38], but these require specific sequencing protocols. In previous studies, a large amount of transcriptome and EST data has been produced. Therefore, we could conduct further analysis with these data and obtain more useful information about polyadenylation. Tian's study, for example, found 29,823 poly(A) sites in humans and 16,282 poly(A) sites in mouse [11]. We identified 25,699 poly(A) sites in Spodoptera frugiperda, which is a reasonable number based on the number of genes in the genome. This study found that in Spodoptera frugiperda, many mRNAs have different gene isoforms, because the location of the cleavage is different. The results showed that APA regulation has a significant impact on regulating gene expression of Spodoptera frugiperda.
APA has an effect on the fate of mRNA, and longer UTRs are unfavorable for gene expression because APA can provide more sites for miRNA-binding [39] RNA-binding recognition proteins. The result shows that the distribution of PAS variants was consistent with that of human and fly [12]. The PAS usage frequency of AAUAAA/AUUAAA is similar to that of other PAS variants, and all of these sites are used for CPSF recognition. This indicates that in different animals the polyadenylation process is relatively conserved. Interestingly, the previously reported PAS sequences upstream of the cleavage sites were not found in 10.13% of transcripts of Spodoptera frugiperda [10]. The lost signals indicate there are alternative PAS sequences or an alternative mechanism that has not been identified. Compared with Beaudiong and Tian's study [10,11], this result shows the higher frequency of unidentified PAS and the lower ratio of known PAS variants. However, such a result can be affected by more than one factor because the basis of this analysis is RNA-seq data, whose quality is essential for PAS analysis. What is more, this study utilized more data than the previous studies. In addition, the available scale of PAS sequencing data of human and mouse is much richer.
Through GO analysis and KEGG analysis of the APA gene of Spodoptera frugiperda, it can be seen that the APA gene-coding protein is widely distributed in cells and has many mediated pathways. The results of GO analysis showed that the encoded proteins were mostly used to perform two biological functions: binding and catalysis. KEGG results showed that the APA gene was enriched in various pathways related to cell metabolism. Through research, it can be found that the APA gene has a wide range of functions. The APA mechanism leads to the same gene producing different mRNA under different conditions, which is conducive to the normal progress of life activities and adaptation to environmental changes.
The research has significant future directions. Although we identified 50,616 poly(A) sites, only 59.22% of them could be mapped to genes. The insufficient and unclear annotation of the Spodoptera frugiperda genome data may be the reason for this phenomenon. Another reason may be that the full-length transcriptome data only comes from the brain tissue of the larva, and there is no tissue source of other instars and other tissues, which leads to some transcripts not being expressed in the brain tissue, and thus the polyadenylation site cannot be found. In a specific tissue, the usage of polyadenylation sites may show certain rules. For example, short transcripts with early pAs are common in testis, as shown in Drosophila melanogaster [21], which is also the reason why the poly(A)sites we identified are not comprehensive. Moreover, the usage of tailing sites in brain tissue should be explored in future studies. Therefore, more transcriptome data information of different Spodoptera frugiperda tissues is needed to solve these problems.

Conclusions
In this study, we first identified 50,616 pA sites on the Spodoptera frugiperda and analyzed them. We also analyzed the distribution and frequency of nucleotides and motifs within 100 bp upstream and downstream of 50,616 pA sites to identify the core cis-acting elements that define these pA sites. The results showed that 51.64% of the 50,616 pA sites had the conserved AAUAAA hexamers, while 10.13% of pA sites had none of the AAUAAA-like hexamers. Among these genes, more than 64.76% have more than one pA site. The PAS data provided by this paper will help to improve the annotation of Spodoptera frugiperda genome and facilitate the research of Spodoptera frugiperda. In addition, the results of this experiment can be used as a basis for the comparative analysis of polyadenylation signal between different species.