Genesis of non-coding RNA genes- a sequence connection with protein genes separated by evolutionary time

A small phylogenetically conserved sequence of 11,231 bp termed FAM247 is repeated in human chromosome 22 by segmental duplications. This sequence forms part of diverse genes that span evolutionary time, the protein genes being the earliest as they are present in zebrafish and/or mice genomes, the long non-coding RNA genes and pseudogenes the most recent as they appear to be present only in the human genome. We propose that the conserved sequence provides a nucleation site for new gene development at evolutionary conserved chromosomal loci where the FAM247 sequences reside. The FAM247 sequence also carries information in its open reading frames that provides protein exon amino acid sequences; one exon plays an integral role in immune system regulation, specifically, the function of ubiquitin specific protease (USP18) in the regulation of interferon. An analysis of this multifaceted sequence and the genesis of genes that contain it are presented.


Introduction
The genesis of genes has been a major topic of interest for several decades [1,2]. One mechanism of gene origins is the formation of genes from the duplication of existing genes [1,3]. This is considered one of the major processes for new gene formation, but it has also been shown that there is a prevalence of gene birth from non-coding DNA via de novo gene formation [4][5][6][7]; this pathway also significantly contributes to new gene formation [4,7].
In this treatise we present and analyze gene development by an evolutionary conserved ancestral sequence. This is a repeat element, previously termed clincRNA [8] and now termed FAM247. The long intergenic non-coding RNA (lincRNA) FAM247A gene sequence has been used as a guide to find homologous sequences and heretofore FAM247 is used in place of FAM247A. We propose that FAM247 carries information to form nucleation sites for gene development and this is exemplified with the maturation of pseudogenes by addition of chromosomal sequences at specific sites on FAM247. The FAM247 forms part of non-coding RNA genes that appear to be human specific. The sequence is also found in protein genes, USP18 (ubiquitin specific protease) and GGT5 (gamma glytamyltransferase). Both these genes date back in evolutionary time, USP18 over 350 million years ago (MYA) and GGT5 over 90 MYA. Thus, the FAM247 sequence has formed a part of genes through much of vertebrate evolution.

Background on conserved linked sequences
FAM247 is present in different segmental duplications or low copy repeats (LCR22) in human chromosome 22 (chr22) as part of phylogenetically conserved linked gene sequences [8]. Figure 1 is a representation of these linked sequences and shows conserved nearest neighbor sequence signatures found in humans. The linked gene sequences are repeated in chr 22 and generate gene families. The signatures are also representative of ancestral primate linked-sequences, e.g., the sequence arrangement in Figure 1b is present in the Rhesus Old World monkey ( Macaca mulatta) where FAM247 and spacer sequences are linked to GGT1 on chr10. The spacer sequence (3953 bp) depicted in Figure 1 is also evolutionarily conserved. It is present in Pan troglodytes (chimpanzee), Papio Anubis (olive baboon), Pongo abelii (Sumatran orangutan), and Macaca mulatta (Rhesus monkey) genomes but it does not encode genes or form part of genes [8]. That it is evolutionarily conserved indicates it may have a function. FAM247 is the common denominator in Figure 1a and 1b. In Figure 1c, the FAM247 sequence depicted is embedded the USP18 gene. depicting different sequences, as described in [8]. In Figure 1c, the FAM247 sequence (green highlight) is embedded the USP18 gene. Table 1 contains a list of human gene families that are found in repeat units shown in Figure 1 and indicates the sequence or chromosomal locus of origin. For example, GGT represents the locus of origin of GGT1, the gamma-glutamyltransferase and gamma-glutamyltransferase light chain genes and their respective pseudogenes; FAM247 is the sequence/locus of origin of GGT5.  [9,10]. POM121L9P and BCRP3 stem from the FAM247 sequence at chromosomal loci where the GGT sequence is found, as represented in Figure 1b. USP18 is the ubiquitin specific peptidase gene, a member of the deubiquitinating protease family; the protein product plays a major role in interferon regulation [11] and has multiple functions [12].

lincRNA gene families
The FAM230 lincRNA and FAM247 lincRNA gene families (https://www.genenames.org/) exemplify how segmental duplications or low copy repeats in chromosome 22 are a driving element in the genesis and proliferation of lincRNA gene families. Ten FAM230 and five FAM247 genes are present in chr22 low copy repeats (LCR22) that originate from sequence duplications [8]. FAM230 family genes differ from one another in sequence, transcript sequence and exon number, and in RNA expression in various fetal developing tissues [8,13,14]. Their functions are not known. FAM230 sequences are also present in primates, but these sequences are annotated as predicted protein genes or pseudogenes, not as lincRNA genes, e.g., more than eleven genes that contain the FAM230 sequence in chimpanzee and gorilla are annotated as protein genes and two FAM230 sequences in Rhesus monkey and olive baboon are annotated as pseudogenes. An example is LOC106992440, which is found in the Rhesus monkey is sequence is present in the Rhesus genome but is not found in the genome of the Prosimian primate ancestor, Philippine tarsier.
The FAM247 lincRNA gene family may have newly formed in humans as there are few or no differences in gene sequence or in RNA transcript expression in somatic and fetal developing tissues [8,13]. An homologous sequence to FAM247 is present in chimpanzee and is linked to GGT2. It has the full length FAM247 sequence [8], but the chimpanzee sequence has not been annotated as encoding a gene.
Segments of sequences homologous to FAM247 are found in other primate genomes (gorilla, orangutan, Rhesus monkey, Philippine tarsier), and these FAM247 sequences may date back evolutionarily to the house mouse and zebrafish (discussed below).

The FAM247 sequence is present in diverse genes
A significant property of the FAM247 sequence is that it forms part of diverse genes. Sequences homologous to FAM247 form genes that include lincRNA genes, pseudogenes, and protein genes ( Figure   2). These genes stem from phylogentically conserved nearest neighbor gene loci where the FAM247 sequence is linked to adjacent genes that form signatures containing gene families e.g., FAM230E-FAM247C-GGT3P present in segmental duplication LCR22A and FAM230B-FAM247A-GGT2 in LCR22D.
Other than the FAM247 lincRNA family genes, which contain the entire 11,231 bp, only segments of  To analyze the phylogenetic relatedness of USP18 gene nt and aa sequences, sequences were aligned from zebrafish, the house mouse, three primate species and humans. The resultant percent sequence identities mimic evolutionary distances between species (Table 2)

USP18 exon 11
Both human exon 11, which encodes the last 14 aa (the carboxy terminal end) of the USP18 peptidase, and the 3' UTR of the USP18 mRNA sequence are provided by the FAM247 sequence [8]. The identity between the FAM247 nt sequence and the human/primate exon11 nt sequences is 100%, with the exception of that of Philippine tarsier (Figure 4). The sequence of the carboxy terminal exon is more stable than that of the sequence of entire gene (compare with Table 2). The identities of the USP18 3'UTR sequences from various species compared to FAM247 (Table 3) shows the 3' UTR sequence is also conserved in primates, but to a lesser extent than that of that of exon 11 and is more similar to the USP18 gene.  The sequence similarity of 52% between FAM247 and zebrafish exon 11 ( Figure 4B), the presence of a number of invariant nt positions ( Figure 4C) and the similarity with the 3'UTR sequence (53%) ( Table 3) suggests that this part of the FAM247 sequence was present in the USP18 sequence of zebrafish. The invariant nt residues of exon 11, e.g., positions nt 5-9 ( Figure 4C) may relate to the functional importance of the USP18 carboxy terminal end in its role in the regulation of the immune system by USP18 [11,12].   Figure 5 shows USP18 aa sequence percent identity, sequence alignment and a phylogenetic tree produced from an alignment of USP18 terminal exon aa sequences from different species with the translated aa sequence of FAM247. Eight of the 14 amino acid residues that form the terminal exon are totally conserved from primates to zebrafish, together with FAM247 translated aa sequence. (Figure 5c).

The USP18 carboxy terminal peptide sequence interacts with the INFAR2 interferon receptor and is an
important regulator of IFN signaling [11]; in addition, the carboxyl end sequence functions in delSGlyation [18,19,12]. A mutation in L365 in the exon 11 sequence 359QETAYLL365VYMKMEC372 abolishes deISGylation and INAFR2 binding [19]; L365 is one of the evolutionary conserved amino acids of exon 11 ( Figure 5). The mutation may alter the protein conformation necessary for function. On the other hand, the high number aa residues conserved relative to the FAM247 translated aa sequence adds to the suggestion the FAM247 sequence was present in zebrafish USP18. .

GGT5
The human GGT5 protein gene resides in chromosomal segmental duplication LCR22G and is linked to pseudogene POM121L9P with a spacer sequence and the pseudogene GGTLC4P situated between GGT5 and POM121L9P ( Figure 6) [8]. The GGT5 nearest gene/sequence arrangement is more complex than that of the signatures shown in Figure 1b. GGT5 is an anomaly as its sequence does not stem from a GGT locus, as other GGT family members do, but from the chromosomal site containing the FAM247 sequence [8]. GGT5 and POM121L9P appear to have formed at very different evolutionary times. FAM247 is part of GGT5 genes in non-human primates, including Philippine tarsier. In addition, FAM247 provides the sequence for exon 1 of GGT5. There is a significant similarity between the FAM247 nt sequence and that of the mouse GGT5 exon 1 ( Figure 7A). There is not enough evidence to suggest that the mouse GGT5 contains the entire 5' half of the FAM247 sequence but alignment of the mouse exon 1 nt sequence with FAM247 shows that a significant number of nucleotides are invariant ( Figure 7B). Although there is invariance in 50 out of 173 nt between the FAM247 sequence and zebrafish GGT5 exon 1, the zebrafish exon sequence shows significant differences, which makes it difficult to further assess a sequence similarity. The exon 1 data are consistent with the formation of the GGT5 gene with the FAM247 sequence that occurred before the evolutionary appearance of primates and appearing in mice or an ancestor to mice. tissues [13,14]. Its functions are not known but should be of interest in view of the strong RNA expression levels. In an homologous nearest neighbor gene arrangement present in chimpanzee chr22 that are annotated as glutathione hydrolase light chain 2 gene (LOC749018) and putative POM121-like protein 1 gene, LOC112206778; these are linked to GGT5 through the spacer sequence ( Figure 9). Thus the human pseudogenes GGTLC4P and POM121L9P sequences are annotated as protein genes in the homologous chromosomal loci of chimpanzee. This is another example of ncRNA genes in humans annotated as protein genes in non-human primates and is of significance, but isolation of protein products from the chimpanzee genes is essential to confirm this.

Preprints
Of importance is that 69% of the POM121L9P sequence is present in the chimpanzee genome with a 98% identity, at the genomic region where the comparable chromosomal locus resides in chimpanzee.
There are no FAM247 or POM121L9P sequences that have been found linked to GGT5 in Rhesus. It appears that the development of the POM121L9P sequence began in the chimpanzee, with but a partial sequence. Human BCRP3 and POM121L10P are linked to GGT1 in the gene/sequence arrangement, GGT1-spacer-BCRP3-POM121L10P, which is present in chr22 LCR22H. FAM247 forms part of the two pseudogenes: BCRP3, which has the FAM247 positions 33-5958 and POM121L10, positions 5957-8219 ( Figure 2). Thus parts of the 5' and 3' regions of FAM247 are found in genes found on these linked genes, which is similar to the presence of FAM247 in genes GGT5 and POM121L9P.
BCRP3 is a member of the BCRP pseudogene family consisting of eight pseudogenes, all of which contain the homologous sequence of the 3' end sequence of BCR protein gene. However, BCRP3 differs as it contains additional sequence motifs ( Figure 10) and is the only BCRP family member that contains the FAM247 sequence. The BCRP3 gene appears to have a unique sequence. The compositional make-up of BCRP3 shows that its 5' side has the FAM247 sequence, which is followed by a 4255 bp segment of the immunoglobulin lambda locus (IGL) and the 3' end of the BCR sequence ( Figure 10). The IGL sequence is homologous to the IGL locus V segments and three C segments, which are known to not encode immunoglobulin proteins. The IGL sequence has an Alu sequence at the junction with FAM247, which may relate to the process of attachment of IGL to FAM247. In terms of RNA expression, the pseudogene shows broad expression of linear RNA in 27 normal somatic tissues and a broad expression of circular RNA in developing fetal tissues [13,14]. The POM121L10P sequence is linked to BCRP3 on chr 22. It also contains the FAM247 sequence ( Figure   2). POM121L10P is compositionally made up of nearly the entire sequence of the related pseudogene POM121L1P, but has a 1062 bp sequence at its 3' end that consists of a copy of the 3' end of the BCR gene.
POM121L10P also appears to be a unique gene construct. The POM121L10P linear RNA transcript is strongly expressed in testes; circular RNAs are broadly expressed fetal tissues. [13,14]. Thus both this gene and BCRP3 show a robust RNA expression. It should be pointed out that there are additional POM121LP pseudogene family members that carry the FAM247 sequence but are not addressed here.
In the rhesus genome, GGT1 is linked to the spacer sequence and followed by the FAM247 sequence, which is similar to that of the human GGT1 gene/sequence arrangement ( Figure 1B). Rhesus gene LOC107000612, annotated as a "breakpoint cluster region protein-like protein" is situated close to GGT1.

Conclusions
Both the FAM247 lincRNA gene family and the various pseudogenes appear to have the repeat FAM247 sequence as a foundation for gene development, however, the mechanism of formation and the compositional make-up between the lincRNA genes and pseudogenes greatly differs. The these may contribute to the process of sequence addition. As these pseudogenes are unique with large sequences unrelated to the parent protein gene, the question is whether they should be called pseudogenes. How USP18 and GGT5 protein genes developed is not know but a putative ancient FAM247 sequence was likely involved. A separate but important aspect of the FAM247 sequence in cellular and molecular functions is that it contributes the amino acid sequence for protein exons, the first exon of GGT5 and last exon of USP18. The functions of the carboxyl terminal aa sequence of USP18 are of major significance because of the role in the regulation of the immune system.

Conflicts of Interest:
The author declares no conflict of interest.