Insights into the Mechanism of Pre-mRNA Splicing of Tiny Introns from the Genome of a Giant Ciliate Stentor coeruleus

Stentor coeruleus is a ciliate known for its regenerative ability. Recent genome sequencing reveals that its spliceosomal introns are exceptionally small. We wondered whether the multimegadalton spliceosome has any unique characteristics for removal of the tiny introns. First, we analyzed intron features and identified spliceosomal RNA/protein components. We found that all snRNAs are present, whereas many proteins are conserved but slightly reduced in size. Some regulators, such as Serine/Arginine-rich proteins, are noticeably undetected. Interestingly, while most parts of spliceosomal proteins, including Prp8′s positively charged catalytic cavity, are conserved, regions of branching factors projecting to the active site are not. We conjecture that steric-clash avoidance between spliceosomal proteins and a sharply looped lariat might occur, and splicing regulation may differ from other species.


Introduction
Stentor coeruleus is a giant single-celled ciliate which lives in freshwater environments worldwide. The organism is relatively large (up to 2 mm in length) and has a clear anteriorposterior axis, detailed cortical patterning, and an ability to repair itself even after being damaged with large wounds in the plasma membrane [1,2]. These unique characteristics of S. coeruleus make it an excellent model organism, and it has thus long been used as a model organism for studying unicellular regeneration, wound response, and cell repair mechanisms [3,4]. Although the organism has been studied by many scientists for several decades, its genome and transcriptome have just recently been sequenced [4,5]. Since the genome sequencing reveals that Stentor uses the standard genetic code, unlike many ciliates, it has been proposed that the protist may have branched from others before ciliatespecific genetic codes arose [4]. Moreover, despite its large cell size, spliceosomal introns are extremely tiny-merely 15 to 16 nucleotides (nt) long. Given that the median intron length in protein-coding genes of budding yeast Saccharomyces cerevisiae and humans is approximately 148 and >1300 nucleotides, respectively [6,7], the exceptionally small introns of S. coeruleus raise an important question about whether the protist may have an unusual mechanism of pre-mRNA splicing [2].
Pre-mRNA splicing is an RNA processing step which removes non-coding introns from a premature transcript [8][9][10]. The process is catalyzed by a megadalton ribonucleoprotein complex of spliceosome, which is involved with five small nuclear ribonucleoproteins (snRNPs)-U1, U2, U4, U5 and U6-and many non-snRNP-associated proteins [9][10][11]. To function, the spliceosome has to be de novo-assembled on each intron in a step-wise manner [9][10][11]. First, U1 and U2 snRNPs recognize the 5 splice site (SS) and branchpoint (BP) sequence of the intron, respectively, to form a pre-spliceosomal complex. The pre-assembled U4/U6.U5 tri-snRNP then joins to shape a pre-catalytic spliceosome. Subsequently, an extensive structural rearrangement of both protein and snRNA components of the spliceosome occurs to activate the snRNPs [9,10,12]. Upon the activation, the spliceosome coordinates the two transesterification reactions required for intron removal and the completion of the splicing cycle [9][10][11]13]. For proper splicing fidelity, spliceosome assembly and activation are tightly controlled by several ATP-dependent RNA helicases as well as base-pairing interactions between snRNAs of the spliceosome and the intronic sequences of the pre-mRNA [14,15].
Given that the remarkably small introns of S. coeruleus are recognized and spliced out by the relatively large spliceosome, we wondered how splicing in this species is regulated and therefore analyzed intronic sequences and spliceosomal components at the genomic level. Here, we show several intriguing features of Stentor introns. Moreover, we informatically identify snRNA and protein components of the Stentor spliceosome. We also propose a base-pairing scheme of the spliceosomal active site and its interaction with the intronic substrate. Intriguingly, although most spliceosomal protein homologs are present and similar to their vertebrate counterparts, the size of most spliceosomal proteins is reduced and certain regions of branching factors at the active site are non-conserved. We conjecture that an avoidance of steric clashes between spliceosomal components and a looped structure of intron lariat may take place in this species and hypothesize that the regulation of pre-mRNA splicing of Stentor introns may be distinct to others due to its ordinarily small size of intron.

Features of Stentor Introns
To gain more insight into the splicing mechanism in Stentor coeruleus, we first analyzed features of intronic sequences using reported genomes and databases [4,16]. In this species, 8806 introns were annotated [4,16]. Among these, 8173 of them (92.81%) were 15 nucleotides (nt) long, while the rest (633 intronic sequences; 7.19%) were 16 nt long. From the annotation general feature format (GFF) and the genomic sequence of the ciliate, we then extracted the sequences of each annotated intron to analyze it in further detail. According to the frequency plots [17], most of the introns start with GU and end with AG nucleotides, most commonly observed 5 -splice sites (SS) and 3 -SS of introns of major spliceosome in many eukaryotic species. Though the branchpoint region of Stentor introns does not show a strong consensus, the majority of the branchpoint adenosine (BP-A) of 15 nt-long and 16 nt-long introns almost unvaryingly reside in the 10th (8005 introns; 90.90% of all introns) and 11th position (442 introns; 5.02% of all introns), respectively ( Figure 1A). On the other hand, approximately 2% of introns in each case (168 introns or 1.91% for 15 ntlong sequences and 191 introns or 2.17% for 16 nt-long introns) harbor the BP-A at other positions. Strikingly, introns of the ciliates are enriched with adenine (A) and uracil (U), as the sum of the percentage of both nucleotides (AU content) is as high as 75.63% ( Figure 1B). This analysis suggests that splicing of most Stentor introns, when spliced, would result in an AU-rich 10 to 11 nt-long circular lariat with a 5 nt-long 3 tail. It is interesting to note that Stentor's 5 exon seems to be unlike that of mammals, in which the last nucleotide of the exon bordering the splice donor site is usually a G ( Figure 1A).  Next, we analyzed global features in Stentor introns in biological contexts. We first simply asked how abundant intronic sequences are in S. coeruleus. We found that while a larger number of 28,064 genes lack introns, 6218 genes have the sequences, indicating that only 18.14% of genes contain introns ( Figure 1C). Among the intron-containing genes, most of them have merely one intron, while a large proportion seems to have only a few ( Figure 1C,D). Though some genes harbor more than eight introns, we were uncertain whether all of them are functional and actually spliced; further in vivo experiments must be required. Interestingly, we observed that the presence of introns is correlated with a longer gene length; the median gene length of intron-containing genes is 1230 nucleotides and significantly longer than that of intron-less genes, which is only 939 nucleotides in length ( Figure 1E). Moreover, we also observed that introns have a positional bias toward the 5 end of each gene ( Figure 1F); a similar phenomenon has also been found in many eukaryotic species, including Saccharomyces cerevisiae [18,19]. Gene Ontology (GO) and pathway enrichment analyses of genes harboring introns showed significant enrichments of genes involving catalytic activity, ion binding, organic cyclic compound binding, and several metabolic processes, suggesting potential physiological roles of gene regulation at the level of pre-mRNA splicing in S. coeruleus (Figures S1 and S2 and Table S1).

Identification of Spliceosomal snRNAs in S. coeruleus
Though the introns of S. coeruleus are exceptionally small and may require a unique spliceosomal regulation, little is reported about splicing machineries in the ciliate. Thus, we next aimed at identifying all components of its spliceosome, including U-snRNAs and associated proteins. Since the introns of the protist harbor conventional GU-AG motives ( Figure 1A), we speculated that all major spliceosomal snRNAs might be present. As searching the U-snRNA candidates based on primary DNA sequence similarity often fails due to the low sequence similarity, we used sequences of U-snRNAs from the Rfam database [20] to seek the corresponding U-snRNAs from the S. coeruleus genome using the cmbuild and cmsearch programs of the Infernal package [21]. As expected, we found all snRNAs of the major spliceosome ( Figure S3A-E). Comparisons of sequences, the Sm/Lsm binding site, covariance model, and secondary structure showed that all predicted U-snRNAs of S. coeruleus are similar to other eukaryotic snRNA counterparts (Figure 2A-D). It is interesting to note that none of snRNAs of the minor spliceosome-U11, U12, U4atac, and U6atac-were found using the above strategy. Additionally, as it is consistent with the notion that the existence of the U12-type introns has not been reported, we conjecture that the primary events of pre-mRNA splicing in S. coeruleus are involved with the major spliceosome and the U2-type introns.  Intrigued by the above findings, we further analyzed unique features of all five spliceosomal U-snRNAs. First, Stentor U1 snRNA exhibits a conserved region, ACUUACCU, that potentially binds to the 5 SS of introns, as we found the sequence identical to that of the Rfam model of U1 snRNA (Figure 2A). We also observed that the branchpoint-binding motif GUAGUA in the predicted U2 snRNA of Stentor is also highly conserved ( Figure 2B), suggesting that intron recognition mechanism may be similar to that of other spliceosomes. Since the sequence of putative U4 snRNA of S. coeruleus could be very complementary with the sequence of putative U6 snRNA ( Figure 2D) and the sequence and secondary structure U5 snRNA is highly conserved ( Figure 2C), we conjecture that the formation of the snRNA backbone of the U4/U6.U5 tri-snRNP may be indistinct to that of other species. From these findings, we conclude that the spliceosome of S. coeruleus contains all five snRNAs, the sequences and features of which are most likely similar to their homologs in other eukaryotes.

Identification of Protein Components of Spliceosomal snRNPs in S. coeruleus
Next, we asked whether spliceosomal proteins are also conserved in S. coeruleus. To this end, we first obtained information of each protein from the Uniprot database and used the Uniprot proteome gene identifier (ID) as query in the protein Basic Local Alignment and Search Tool (BLASTP) against the non-redundant protein sequences (nr) database with an Expect I-value cut-off of 1 × 10 −5 . Because the assembly of the S. coeruleus genome is yet to be completed [4] and the proteome database of annotated proteins may still lack certain sequences, we employed the translated nucleotide BLAST (TBLASTN) operation mode against the whole-genome shotgun contigs (wgs) of S. coeruleus with an E-value cut-off of 1 × 10 −5 if the initial BLASTP failed to identify any significant hit (Table 1). First, we analyzed Sm and Sm-like (Lsm) proteins, the core proteins that associate with the U1, U2, U4, and U5 snRNAs and the U6 snRNA, respectively. Consistent with the presence of all five U-snRNAs and the conserved Sm/Lsm binding sites ( Figure 2), all seven Sm (Sm B, D1, D2, D3, E, F, and G) and seven Lsm (Lsm2 to Lsm8) proteins were identified by BLAST ( Table 1), suggesting that Sm/Lsm hetero-heptameric ring complexes are most likely formed and possibly interact with the corresponding U-snRNAs as in other eukaryotes.
We next investigated whether U1, U2, and U4/U6.U5 tri-snRNP specific spliceosomal proteins are present in S. coeruleus. For the U1 snRNP, our BLAST analysis showed that Nam8/TIA1, Prp39/PRPF39, and all three core U1-specific proteins-Mud1/SNRPA (U1A), Yhc1/SNRPC (U1C), and Snp1/SNRNP70 (U1-70k)-are conserved, while the more peripheral U1 snRNP components are undetected by either BLASTP or TBLASTN (Table 1). Strikingly, however, all protein components of U2-snRNP-and U2-snRNP-associated complexes were identified except U2SURP. Since the sequences and predicted secondary structures of U1 and U2 snRNAs are conserved ( Figure 2) and the two undiscoverable proteins are likely vertebrate-specific factors CHERP and U2SURP (Table 1) [22], we conjecture that the core complexes of U1 and U2 snRNPs as well as their associated factors are plausibly similar to those of other eukaryotic species. We subsequently investigated the presence of tri-snRNP proteins at the genomic level. Out of 18 proteins, all except 3 were identified by BLAST ( Table 1), suggesting that U4/U6.U5 tri-snRNPs of S. coeruleus and those of others may share similar structures and functions. We conclude from our findings that all five snRNP complexes of S. coeruleus may be formed and function in a similar manner to the complexes in other species.   Step 2 proteins

Identification of Stentor Spliceosomal RNA Helicases and Other Non-snRNP Proteins Involving Spliceosome Assembly and Activation
Pre-mRNA splicing involves multistep assembly and activation of the spliceosome. During the early step of spliceosome assembly, U1 and U2 snRNPs function by recognizing the intronic sequences of a pre-mRNA and forming a pre-spliceosomal complex known as the A complex [9,10]. Subsequently, the pre-assembled U4/U6.U5 tri-snRNP joins and forms the pre-catalytic spliceosome (B complex), which then undergoes ATPdependent conformational rearrangement of its protein and snRNA components [9,10]. Remodeling of the B complex by the RNA-dependent helicase Brr2/SNRNP200 results in dissociation of the U1 and U4 snRNP complexes and thereafter the recruitment of several non-snRNP proteins, including the NineTeen Complex (NTC) and NTC-related proteins, to form the activated spliceosome (B act complex) [9,10,23]. After further structural changes by the ATP-dependent RNA helicases Prp2/DHX16 and Prp16/DHX38 and dynamic association/dissociation of proteins, the catalytic spliceosome (C complex) is subsequently formed [9,10,14,15].
Given that spliceosome assembly and activation are highly dynamic and important for intron recognition, exon-intron arrangement, and the removal of introns, we next focused on identification of the spliceosomal proteins that are involved in these steps. First, we observed that all seven spliceosomal ATP-dependent RNA helicases-Prp2/DHX16, Prp5/DDX46, Prp16/DHX38, Prp22/DHX8, Prp28/DDX23, Prp43/DHX15, and Brr2/SNRNP200-and one GTPase Snu114/EFTUD2 were all identified in the S. coeruleus genome, implying that the ciliate may also utilize ATP and GTP during spliceosome assembly and activation steps (Table 1). Although the protist seems to lack certain components of splicing complexes, such as proteins recruited during the A complex stage, the NTC and NTC-related proteins, the C complex, and step II proteins, almost all proteins recruited at the B and B act complexes stage are present (Table 1).
Intriguingly, we observed that while orthologs of heterogeneous nuclear ribonucleoproteins (hnRNPs) were identified, all Serine/Arginine (SR)-rich splicing factors and SR-related proteins were absolutely absent from our search results. Both hnRNPs and SRfamily proteins function as general splicing repressors and activators, respectively [24][25][26][27]. Mechanistically, they interact with cis-elements in the transcripts and then recruit and/or stabilize components of the core spliceosome [26,28]. The lack of SR and SR-related proteins may be because S. coeruleus does not need to selectively promote the removal of specific introns. Additionally, exon skipping (ES) may not occur in the species. This may also be explained by the fact the size of Stentor introns is mostly constant at 15-16 nucleotides long [4] (Figure 1A); if the ciliate is able to skip an exon, which is a long nucleotide stretch of nucleotides, the size of its introns must be more deviated. The presence of hnRNPs, on the other hand, indicates that the repression of intron splicing may occur in this species. This could be a splicing-mediated mechanism to alter gene isoforms, thereby controlling gene expression in the ciliate. However, given that many known hnRNPs support a broad range of non-splicing biological functions-including mRNA stabilization and nuclear export and transcriptional and translational regulations [29-32]-we were uncertain whether the hnRNP orthologs found in S. coeruleus exclusively function in pre-mRNA splicing. Nevertheless, the absence of SR and SR-related proteins and the presence of hnRNPs may reflect the unique regulation of tiny-intron splicing and RNA metabolism as well as the relatively intron-poor nature of the protist.
It is important to note that a limitation of our work may arise as a consequence of BLAST analysis, which could fail to detect protein factors with distant homology. Additionally, because we used known splicing factors as query sequences to seek Stentor orthologs, species-specific splicing factors, which may also exist and contribute to the splicing of the exceptionally tiny introns in the protist, could be simply overlooked. Therefore, in order to ascertain spliceosomal components of the ciliate, proteomic and biochemical analyses are definitely required. Nevertheless, our above findings suggest that not only spliceosomal snRNA and protein components are vastly conserved in Stentor, but also many non-snRNP proteins are present. Our findings also suggest that the assembly and activation of the Stentor spliceosome might be conserved to a certain extent, but additional species-specific regulations-if any-could also take place.

A Model of RNA-RNA Interaction Network in Stentor Spliceosomal Active Site
In the fully assembled spliceosome, U2 and U6 snRNAs extensively base-pair with each other and help position the two reacting groups in the first step of splicing-the 5 SS and the branchpoint region-by base-pairing with the two sequences [9,10]. The base pairing between the U2 snRNA and the branchpoint region protrudes the BP-A out from the RNA duplex [9,11]. The 2 OH of the BP-A then undergoes a nucleophilic attack on the 5 SS, and thereby the 5 linkage between the BP-A and the first guanine nucleotide of the intron is formed. During the reaction, the 5 exon is unconnected with the intron but still remains held in the active site via interactions with the U5 snRNA and associated proteins [9][10][11]15]. Next, the second step of splicing involves a nucleophilic attack by the 3 OH group of the 5 exon on the phosphodiester bond at the 3 SS. Ultimately, the spliceosome dissembles and releases the lariat intron [9,11].
Given that introns of Stentor are exceptionally small, we next asked how the U2 and U6 snRNAs base-pair with each other and with the intronic sequences to position the 5 SS and the BP-A. To this end, we analyzed the sequences of the relevant snRNAs of S. coeruleus, predicted the RNA-RNA interaction network, and compared it with that of the human spliceosome ( Figure 3A,B and Figure S4). The active site of the human spliceosome is formed during the transition of the B complex to the B act complex and stays unchanged during the two-step transesterification reactions [9][10][11]. In the catalytically active spliceosome, the U6 snRNA forms an intramolecular stem-loop (ISL) structure and helices I and II with the U2 snRNA [11] (Figure 3B). Although the sequences of Stentor U2 and U6 snRNAs responsible for the formation of U6-ISL, helices I and II are slightly deviated from human sequences ( Figure 3A,B and Figure S1), secondary structure and base pair predictions suggest that Stentor snRNAs may also form the ISL and two helices as well ( Figure 3A). Moreover, the backbone nucleotides of the U6 catalytic triad (A48, G49, and C50) as well as the three nucleotides that form three consecutive triple base pairs with the triad (A41, G42, and U69) are invariantly conserved in Stentor ( Figure 3A), suggesting that the folded RNA structure formed by the stacking of the three pairs of nucleotides might be present, too [11]. To position the pre-mRNA substrate in the active site of human spliceosome, the 5 -end region of the intron needs to be positioned by base pairing with the ACAGAGA box of the U6 snRNA and with the loop 1 of the U5 snRNA, which holds the 5 exon ( Figure 3B) [9,11]. Likewise, the branchpoint sequence of the intron also pairs with U2 snRNA to form a branch helix with the bulged BP-A ( Figure 3B) [9][10][11]. In S. coeruleus, the ACAGAGA sequence of the U6 snRNA and the branchpoint recognition site of the U2 snRNA are highly conserved, implying that the mechanism of 5 SS and branchpoint recognition might also occur in a similar fashion to that of other species ( Figure 3A).  Figure 3A, the previously proposed RNA interaction network of human spliceosome is shown as a reference [11].
Next, we asked how the base paring between the pre-mRNA and U2/U6.U5 snRNAs would form. To this end, we selected the most abundant intronic sequence, GUAAU-UUUUAUAUAG, as a representative (127 occurrences in 8173 introns or 1.55%; where A represents the putative BP-A) and predicted the RNA interaction network. Since the first three intron nucleotides (GUA) are stringently conserved in Stentor (Figure 1A), the sequence is likely able to form Watson-Crick base pairs with the U6 snRNA ACAGAGA box ( Figure 3A). The branchpoint region, on the other hand, is enriched with U nucleotide-the nucleotide which potentially forms not only a Watson-Crick U-A pair, but also a wobble U-G as well as a non-canonical U•U base pair ubiquitously found in non-coding RNAs [33]. Though further validation by genetic and biochemical experiments are required, our observation suggests that the conserved branchpoint recognition site of the U2 snRNA of S. coeruleus possibly base-pairs with the U-rich sequence of the intron branchpoint region ( Figure 3A).
The presence of all snRNAs and most of the core snRNP and non-snRNP proteins suggests that, to a certain degree, the regulation of pre-mRNA splicing in S. coeruleus might be conserved. Besides the network interactions between pre-mRNA and spliceosomal snRNAs, it has been demonstrated that spliceosomal proteins also play roles at the active site [9,11]. Particularly, the largest and highly conserved spliceosomal protein Prp8 occupies the central position in the catalytic core of the spliceosome [13]. We observed  Figure 3A, the previously proposed RNA interaction network of human spliceosome is shown as a reference [11].
Next, we asked how the base paring between the pre-mRNA and U2/U6.U5 snRNAs would form. To this end, we selected the most abundant intronic sequence, GUAAUUUUU AUAUAG, as a representative (127 occurrences in 8173 introns or 1.55%; where A represents the putative BP-A) and predicted the RNA interaction network. Since the first three intron nucleotides (GUA) are stringently conserved in Stentor (Figure 1A), the sequence is likely able to form Watson-Crick base pairs with the U6 snRNA ACAGAGA box ( Figure 3A). The branchpoint region, on the other hand, is enriched with U nucleotide-the nucleotide which potentially forms not only a Watson-Crick U-A pair, but also a wobble U-G as well as a non-canonical U•U base pair ubiquitously found in non-coding RNAs [33]. Though further validation by genetic and biochemical experiments are required, our observation suggests that the conserved branchpoint recognition site of the U2 snRNA of S. coeruleus possibly base-pairs with the U-rich sequence of the intron branchpoint region ( Figure 3A).
The presence of all snRNAs and most of the core snRNP and non-snRNP proteins suggests that, to a certain degree, the regulation of pre-mRNA splicing in S. coeruleus might be conserved. Besides the network interactions between pre-mRNA and spliceosomal snRNAs, it has been demonstrated that spliceosomal proteins also play roles at the active site [9,11]. Particularly, the largest and highly conserved spliceosomal protein Prp8 occupies the central position in the catalytic core of the spliceosome [13]. We observed that the S. coeruleus Prp8 protein is 73.91% identical to the human homolog and the positively charged amino acids in the catalytic cavity of Prp8 share an even higher sequence identity of 94.69% with that of humans ( Figure 4A). The positively charged cavity of Prp8 at the spliceosomal active site is important because it is where the RNA triplex of U2 and U6 snRNAs and the intron lariat is located [11,13]. Strikingly, analysis of the electrostatic surface potential of the cavity showed a notable similarity between the catalytic cavities of Stentor and human spliceosomes ( Figure 4B). Taken together, we conjecture that the active site of Stentor spliceosome is most likely structurally and functionally similar to that of humans.

Regions of Branching Factors Projecting to the Spliceosomal Active Site May Be Unique in Stentor
Structural analyses of human and yeast spliceosomes reveal that protein components of the RNP enzyme are located on the surface of one side of the splicing active site; this leaves the other side freely accessible to the pre-mRNA molecule harboring introns with

Regions of Branching Factors Projecting to the Spliceosomal Active Site May Be Unique in Stentor
Structural analyses of human and yeast spliceosomes reveal that protein components of the RNP enzyme are located on the surface of one side of the splicing active site; this leaves the other side freely accessible to the pre-mRNA molecule harboring introns with a vast range of lengths [13]. Our findings suggest that the spliceosome of S. coeruleus might be structurally and functionally similar to the spliceosome of humans. However, given that the size of the protist introns is much smaller and thus the RNA lariat might form a sharp turn of 10 nt that potentially causes a steric clash with adjacent spliceosomal proteins, we wondered how the intron would fit at the active site. To this end, we focused on the three branching factors-Yju2/CCDC94, Cwc25/CCDC49, and Ntc30/ISY1-which are adjacent to the branch region and stabilize the docking of the U2/U6 branch helix [13]. While having slightly smaller homologs than other proteins ( Figure S5), the N-terminal domain of the ciliate Yju2/CCDC94, which is essential for viability and promotes branching, was highly conserved ( Figure S6A). By contrast, while the N-terminal helix and three invariant tryptophan residues of Cwc25/CCDC49 (Trp 12 , Trp 24 , and Trp 72 in CCDC49) are highly conserved in the protist, its N-terminal plug is uniquely distinct ( Figure S6B). In the human spliceosome, the conserved plug with a glycine-rich motif (Gly 2 -Gly 3 -Gly 4 in CCDC49) is located at the active site and penetrates a small cleft formed by the U2/branchpoint duplex and the helix I of the U2/U6 duplex [11,13]. Interestingly, the Cwc25/CCDC94 protein of S. coeruleus strikingly lacks such a conserved motif ( Figure S6B). Moreover, the N-terminus of Ntc30/ISY1, which is projected into the active site of the spliceosome and forms contacts with the phosphate backbone of the intron to promote branching in other eukaryotes, is strikingly non-conserved in Stentor ( Figure S6C). Though we are uncertain how the active site of the S. coeruleus spliceosome is three-dimensionally formed, the differences in these branching factors might directly and/or indirectly help avoid a steric clash with a looped structure of RNA and contribute to the formation of lariat and branching of the tiny intron of the protist ( Figure 5). domain of the ciliate Yju2/CCDC94, which is essential for viability and promotes branching, was highly conserved ( Figure S6A). By contrast, while the N-terminal helix and three invariant tryptophan residues of Cwc25/CCDC49 (Trp 12 , Trp 24 , and Trp 72 in CCDC49) are highly conserved in the protist, its N-terminal plug is uniquely distinct ( Figure S6B). In the human spliceosome, the conserved plug with a glycine-rich motif (Gly 2 -Gly 3 -Gly 4 in CCDC49) is located at the active site and penetrates a small cleft formed by the U2/branchpoint duplex and the helix I of the U2/U6 duplex [11,13]. Interestingly, the Cwc25/CCDC94 protein of S. coeruleus strikingly lacks such a conserved motif ( Figure  S6B). Moreover, the N-terminus of Ntc30/ISY1, which is projected into the active site of the spliceosome and forms contacts with the phosphate backbone of the intron to promote branching in other eukaryotes, is strikingly non-conserved in Stentor ( Figure S6C). Though we are uncertain how the active site of the S. coeruleus spliceosome is three-dimensionally formed, the differences in these branching factors might directly and/or indirectly help avoid a steric clash with a looped structure of RNA and contribute to the formation of lariat and branching of the tiny intron of the protist ( Figure 5).

Computational Analyses of Features of Introns of S. coeruleus
S. coeruleus genome data were downloaded from the Stentor Genome Database at http://stentor.ciliate.org/ (accessed on 22 April 2020). [4,16]. Intronic sequences were extracted from the assembled genome using coordinates obtained from the general feature format (GFF) file [16] and bioinformatics tools in the Galaxy platform [34]. Sequence logos of intronic sequences were created using WebLogo [17]. Intronic features were computed

Computational Analyses of Features of Introns of S. coeruleus
S. coeruleus genome data were downloaded from the Stentor Genome Database at http://stentor.ciliate.org/ (accessed on 22 April 2020). [4,16]. Intronic sequences were extracted from the assembled genome using coordinates obtained from the general feature format (GFF) file [16] and bioinformatics tools in the Galaxy platform [34]. Sequence logos of intronic sequences were created using WebLogo [17]. Intronic features were computed and plotted using R Studio [35]. Gene Ontology (GO) IDs of genes containing an intron in S. coeruleus were retrieved from UniProt. The GO analysis was run on g:Profiler [36] using Tetrahymena thermophila and Paramecium tetraurelia, members of the phylum Ciliophora, as S. coeruleus, an organism input parameter. The top three enriched GO IDs in the molecular function (MF), biological process (BP) and cellular component (CP) were listed. To compare sequences of genes harboring introns at the genomic and transcriptomic levels, genomic DNA and mRNA sequences were retrieved from StentorDB [16] and recently published RNA sequencing experiments performed with S. coeruleus [5], respectively. Multiple alignments were performed using Clustal Omega with default settings and visualized with the pyBoxShade program [37,38]. Statistical analysis was performed using two-tailed unpaired Student's t-test in GraphPad Prism 9 software [39].

Identification of Spliceosomal Proteins in the S. coeruleus Genome
To identify protein components of the spliceosome in the S. coeruleus genome, the information of each spliceosomal protein was obtained from the Uniprot database [40]. Human spliceosomal proteins (listed in Table 1) were employed as queries in batch in a protein Basic Local Alignment and Search Tool (BLASTP) against the non-redundant (nr) database for S. coeruleus protein sequences on the National Center for Biotechnology Information (NCBI) website with an Expect (E)-value cut-off of 1 × 10 −5 [41]. For proteins with no ortholog detected by the BLASTP search, the translated nucleotide BLAST (TBLASTN) operation mode was employed against whole-genome shotgun contigs (wgs) of S. coeruleus with an E-value cut-off of 1 × 10 −5 [41]. We used yeast-specific spliceosomal proteins as queries instead when the information of human orthologs was unavailable. Multiple alignments were performed using Clustal Omega with default settings and visualized with the pyBoxShade program [37,38].
The putative catalytic center of Stentor Prp8 was predicted by HHpred in conjunction with MODELLER tools [42,43]. Structural comparison and electrostatic surface potential were carried out using UCSF ChimeraX Daily Build version (version 1.3; 7 September 2021) [44].

Conclusions
In this study, we analyzed features of introns of S. coeruleus and identified snRNA and protein components of its spliceosome ( Figure 6). We also propose a base paring model of the spliceosomal active site and discuss its association with an intron sequence. Although most spliceosomal proteins were conserved in the ciliate, their size is reduced. Moreover, the regions of certain branching factors that are adjacent to the spliceosome active site are noticeably non-conserved, suggesting its unique mechanism of active-site arrangement possibly for the avoidance of steric clashes between the intron lariat and spliceosomal components. Though there are limitations in our computational approach and further genetic and biochemical analyses are required, our findings provide an insight into splicing of tiny introns of S. coeruleus.  To date, it is unclear what environmental and/or intrinsic factors cause such a reduction in introns in the ciliate. Additionally, there are still open questions regarding whether vertebrates, including humans, could splice such a small intron, and if not, what the smallest size of the intronic sequences could be. Since small introns are unusual in the human genome and most likely overlooked, the capability of the splicing-either constitutively or stress-induced-could potentially increase mRNA isoforms and thereby the diversity of proteins, some of which might be implicated in the development of human diseases. These intriguing possibilities remain to be explored.