Evolutionary Conserved Short Linear Motifs Provide Insights into the Cellular Response to Stress

Highlights Hundreds of short linear motifs (SLiMs) that exhibit a high degree of sequence similarity to two biologically active sites of human alpha-fetoprotein (AFP) were identified. The SLiMs of interest are ubiquitously distributed and found in proteins of both eukaryotic and prokaryotic species. Proteins retrieved by sequence alignment belonged to various functional classes to be directly or indirectly involved in cellular response to stress. Our findings provide insights into the common functions of evolutionary conserved SLiMs and putative involvement of AFP in response to external and internal stimuli during cellular adaptation during embryonic development and cancer. Abstract Short linear motifs (SLiMs) are evolutionarily conserved functional modules of proteins composed of 3 to 10 residues and involved in multiple cellular functions. Here, we performed a search for SLiMs that exert sequence similarity to two segments of alpha-fetoprotein (AFP), a major mammalian embryonic and cancer-associated protein. Biological activities of the peptides, LDSYQCT (AFP14–20) and EMTPVNPGV (GIP-9), have been previously confirmed under in vitro and in vivo conditions. In our study, we retrieved a vast array of proteins that contain SLiMs of interest from both prokaryotic and eukaryotic species, including viruses, bacteria, archaea, invertebrates, and vertebrates. Comprehensive Gene Ontology enrichment analysis showed that proteins from multiple functional classes, including enzymes, transcription factors, as well as those involved in signaling, cell cycle, and quality control, and ribosomal proteins were implicated in cellular adaptation to environmental stress conditions. These include response to oxidative and metabolic stress, hypoxia, DNA and RNA damage, protein degradation, as well as antimicrobial, antiviral, and immune response. Thus, our data enabled insights into the common functions of SLiMs evolutionary conserved across all taxonomic categories. These SLiMs can serve as important players in cellular adaptation to stress, which is crucial for cell functioning.


Introduction
Short linear motifs (SLiMs) are evolutionarily conserved functional modules of proteins that represent amino acid stretches composed of 3 to 10 residues involved in recognition and targeting activities [1]. SLiMs function through transient interactions with a variety of binding partners, mostly, with globular protein domains of other proteins. Thereby, they are involved in protein-protein interactions, which underlie numerous cellular processes including bacteria, viruses, archaea, and eukaryotes that contain both SLiM types of interest. Amino acid composition analyses of all retrieved SLiMs allowed for the revealing of a high degree of sequence conservation and hotspot residues. Furthermore, we performed comprehensive Gene Ontology (GO) functional enrichment analysis and revealed that the both motif types can be identified in proteins involved directly or indirectly in cellular response to biotic and abiotic stress. Our data allow for the suggestion that these conserved motifs underlie the involvement of a vast array of proteins in cellular response to stress conditions. Also, AFP can be involved in cellular adaptation to oxidative, genotoxic, and metabolic stress during embryonic development and cancer growth.

Search for Short Linear Motifs
We carried out local sequence alignment with the use of both AFP-derived peptides, LDSYQCT and EMTPVNPGV, as queries for sequence similarity search. FastA suite [26] supported by the European Bioinformatics Institute of European Molecular Biology Laboratory (EMBL-EBI) (https://www.ebi.ac.uk/Tools/sss/fasta/ (accessed on 8 January 2022)) [27] was exploited. The alignment was performed against UniProtKB protein knowledgebase (https://www.uniprot.org/ (accessed on 8 January 2022)), both UniProtKB/Swiss-Prot (the manually annotated and reviewed) and UniProtKB/TrEMBL (the automatically annotated) sections [28]. No restriction in taxonomic categories was applied. GLSEARCH (version 36.3.8 h) algorithm provided the most optimal search for sequences that match the query peptides. Default parameters: BLOSUM50 matrix, gap open -10, gap extension -2, expectation value (E-value) upper unit 10 and lower unit 0 to obtain up to 500 alignments were utilized.

Amino Acid Conservation Analysis
SLiMs obtained with the use of the FastA GLSEARCH algorithm were further subjected to amino acid substitution analysis. Amino acid substitutions at each position of all SLiMs were calculated as follows: N = a/b × 100%. Here, a is the quantity of a definite residue at a definite position and b is the total number of SLiMs. All SLiMs including those aligned to AFP itself from all species and uncharacterized and hypothetical proteins were taken into account. Graphical representation of the amino acid conservation was performed with the use of the WebLogo3 (http://weblogo.threeplusone.com/create.cgi (accessed on 5 February 2022)) tool [29].

Functional Classification of Retrieved Proteins
All proteins extracted from the both Swiss-Prot and TrEMBL sections of UniPro-tKB database were subjected to GO term-based functional classification [30] in both the molecular functions and biological processes categories (http://geneontology.org/ (accessed on 14 May 2022)). These included all retrieved proteins from both prokaryotic and eukaryotic taxonomies. Since TrEMBL is a large section that contains automatically annotated proteins, a cut-off of E-value 0.1 and identity degree of 57.1% for AFP [14][15][16][17][18][19][20] like motifs and E-value 0.1 and identity degree of 55.6% for GIP-9-like motifs were applied for alignments against this section of UniProtKB. In addition to UniProtKB, InterPro (https://www.ebi.ac.uk/interpro/ (accessed on 7 June 2022)) protein family resource was used for functional annotations of the retrieved proteins [31].

Gene Set Enrichment Analysis
For further gene set enrichment analysis, two lists of genes coding for the retrieved proteins containing either AFP 14-20 -like or GIP-9-like motifs were manually created. The UniProtKB-IDs were used and when needed they were converted into Ensembl gene IDs and STRING-db proteins IDs. These datasets were used as backgrounds for GO enrichment analysis. The created lists were first uploaded into PANTHER classification system (http://pantherdb.org/ (accessed on 17 May 2022)) of the Gene Ontology resource [32]. The R/Bioconducter packages in graphical ShinyGO v0.75 suite (http: //bioinformatics.sdstate.edu/go/ (accessed on 24 May 2022)) was utilized [33] for further functional enrichment analysis. Characteristics of a list of genes were compared with other genes of the whole genome (background) and Student's t-test was applied. Additionally, the gProfiler functional enrichment analysis [34] resource (https://biit.cs.ut.ee/gprofiler/ (accessed on 28 May 2022)) was utilized. Here, the gSCS statistical threshold to be equal to 0.2 and ENTREZGENE_ACC numerical IDs to extract all known gene sets were exploited. Figure 1 depicts the overall U-shaped architecture and 3D organization of human AFP with secondary structure elements represented by alpha-helices and loops with no betastrands. Visualization of the obtained structure showed that the two distinct functionally important segments of human AFP with experimentally confirmed biological activities, AFP 14-20 and GIP-9, are located on the protein surface to be accessible to the solvent and/or protein binding. against this section of UniProtKB. In addition to UniProtKB, InterPro (https://www.ebi.ac.uk/interpro/ (accessed on 7 June 2022)) protein family resource was used for functional annotations of the retrieved proteins [31].

Gene Set Enrichment Analysis
For further gene set enrichment analysis, two lists of genes coding for the retrieved proteins containing either AFP14-20-like or GIP-9-like motifs were manually created. The UniProtKB-IDs were used and when needed they were converted into Ensembl gene IDs and STRING-db proteins IDs. These datasets were used as backgrounds for GO enrichment analysis. The created lists were first uploaded into PANTHER classification system (http://pantherdb.org/ (accessed on 17 May 2022)) of the Gene Ontology resource [32]. The R/Bioconducter packages in graphical ShinyGO v0.75 suite (http://bioinformatics.sdstate.edu/go/ (accessed on 24 May 2022)) was utilized [33] for further functional enrichment analysis. Characteristics of a list of genes were compared with other genes of the whole genome (background) and Student's t-test was applied. Additionally, the gProfiler functional enrichment analysis [34] resource (https://biit.cs.ut.ee/gprofiler/ (accessed on 28 May 2022)) was utilized. Here, the gSCS statistical threshold to be equal to 0.2 and ENTREZGENE_ACC numerical IDs to extract all known gene sets were exploited. Figure 1 depicts the overall U-shaped architecture and 3D organization of human AFP with secondary structure elements represented by alpha-helices and loops with no beta-strands. Visualization of the obtained structure showed that the two distinct functionally important segments of human AFP with experimentally confirmed biological activities, AFP14-20 and GIP-9, are located on the protein surface to be accessible to the solvent and/or protein binding. The overall architecture of AFP is represented by a U-shaped structure composed of three domains: I (orange, residues 19-210), II (green, residues 211-402), and III (cyan, residues 403-601). Two functionally important segments, AFP14-20 with sequence LDSYQCT (residues 32-38, colored in blue) and GIP-9 with sequence EMTPVNPGV (residues 489-497, colored in red) that is a part of GIP-34 (residues 464-497, colored in pink), respectively, are shown.
The fist peptide segment is located in the domain I, close to N-terminus, and arranged in α-helical conformation. The second segment encompasses C-terminal part of GIP-34 peptide that occupies the most prolonged α-helical stretch in the domain III. Only a little part of secondary structure elements of the GIP-9 peptide is arranged in α-helix, while the remaining part represents a disordered region, and this can have a role in its functionality.

Proteins Containing SLiMs of Interest Are Biologically Diverse
Local sequence alignment enabled retrieval of 464 proteins from Swiss-Prot section of UniProtKB database and 500 proteins from its TrEMBL section (with maximum E-value 6.9 × 10 -4 ) that contain SLiMs with sequence similarity to LDSYQCT peptide. They covered proteins from a wide range of taxonomic categories and included uncharacterized and hypothetical proteins. Table 1 shows the most representative proteins from various species aligned with LDSYQCT sequence and the alignment E-values: the lower the E-value, the higher the statistical significance of the alignment. In the alignment column, the upper sequence is a query, whereas the lower sequence is from the retrieved protein. Proteins that contain AFP 14-20 -like motifs play various biological roles including transcriptional and translational regulation, oxidoreductase and electron transfer activity, protein quality control, host-pathogen interaction, biotic and abiotic stress response, and component of ribosomes and the toxin-antitoxin system, etc.
SLiMs with sequence similarity to EMTPVNPG octapeptide were identified in 258 proteins from the Swiss-Prot section and 500 proteins from the TrEMBL section (with maximum Evalue 4.3 × 10 −2 ) of UniProtKB database. These proteins covered all taxonomic categories and included AFP from different biological species, uncharacterized and hypothetical proteins. Table 2 contains the most representative proteins aligned to GIP-9 segment and the alignment E-values. Proteins that contain GIP-9-like motifs also have a wide range of biological roles including involvement in cell signaling, transcriptional regulation, metabolic processes, response to chemicals, immune response, electron transfer, etc.

SLiMs of Interest Are Enriched in Conserved Residues
After the exclusion of the same proteins from different taxonomies, 199 AFP 14-20 -like and 280 GIP-9-like unique motifs were identified. Furthermore, we assessed amino acid frequencies at each position of the unique SLiMs. The most conserved residue in AFP 14-20like motifs was cysteine (C) that comprises 100% of the total residue number at position 6. The second-most conserved residue was aspartic acid (D) that constituted 82.9% of all residues at position 2 and can be replaced, predominantly, by physicochemically similar asparagine (N) and glutamate (E). Two aromatic amino acids, tyrosine (Y) and phenylalanine (F), comprised 83.9% of all residues at position 4. While 57.3% of residues at position 1 were represented by leucine (L) that can be substituted for other hydrophobic residues-methionine (M), isoleucine (I), and valine (V). Serine (S) constituted 54.8% of all residues at position 3 to be replaced, mostly, by hydrophilic amino acids-T, K, and E. Hydroxyl group-containing residues, T and S, constituted 65.3% of all residues at position 7. Glutamine (Q) comprised 67.8% of all residues at position 5 to be replaced by charged and hydrophilic amino acids-lysine (K), aspartate (D), and arginine (R). These calculations with the application of a threshold of 5% resulted in the following notation for the consensus sequence: Figure 2A graphically depicts the frequency of each amino acid at every position of the retrieved AFP 14-20 -like motifs.    As for GIP-9-like motifs, three most conserved positions were identified-4, 7, and 8. Positions 4 and 7 were occupied by proline (P) residue that comprised 96% and 98% of all residues, respectively. The third-most conserved residue was glycine (G) that comprised 92% of all residues at position 8. The least conserved position was 2, where methionine (35.4%) was the most frequent residue and could be replaced by other hydrophobic amino acids-L, I, and V. At position 1, glutamic acid residue constituted 60.4% of all residues to be replaced, more frequently, by D, Q, and K, which have similar physicochemical properties. Threonine (T) constituted 51.8% of all residues at position 3 to be replaced most frequently, by serine (S), a physicochemically similar residue (12.9%). Position 5 was occupied, mostly, by large hydrophobic residues-V (55.0%), I (20.4%), and L (7.9%). At position 6, asparagine (N) comprised 56.8% of all residues and the most significant replacement was for D and S, while position 9 was occupied by large hydrophobic amino acids-V (47.1%), L (12.1%), and I (18.2%). On the basis of the calculations, the following notation for consensus sequence was identified: Figure 2B graphically depicts frequency of each amino acid at every position of the identified GIP-9-like motifs.
Therefore, both SLiM types of interest contain a large proportion of conserved amino acid residues indicating that they are evolutionarily preserved through all biological species starting from bacteria and viruses to higher eukaryotes. Interestingly, the consensus sequences were enriched in D, S, and P residues found in prototype peptides, which have been proposed to give rise to modern proteins.

Retrieved Genes Ubiquitously Exist
Furthermore, we classified unique genes that code for proteins containing both SLiM types of interest on the basis of their belonging to any taxonomic category. We found that the retrieved genes are widely distributed among all taxonomic groups, including bacteria, viruses, archaea, and various invertebrate and vertebrate species, including mammals and primates. Figure 3A,B depicts the taxonomic distribution of unique gene coding for proteins with AFP14-20-like and GIP-9-like motifs, respectively. As for GIP-9-like motifs, three most conserved positions were identified-4, 7, and 8. Positions 4 and 7 were occupied by proline (P) residue that comprised 96% and 98% of all residues, respectively. The third-most conserved residue was glycine (G) that comprised 92% of all residues at position 8. The least conserved position was 2, where methionine (35.4%) was the most frequent residue and could be replaced by other hydrophobic amino acids-L, I, and V. At position 1, glutamic acid residue constituted 60.4% of all residues to be replaced, more frequently, by D, Q, and K, which have similar physicochemical properties. Threonine (T) constituted 51.8% of all residues at position 3 to be replaced most frequently, by serine (S), a physicochemically similar residue (12.9%). Position 5 was occupied, mostly, by large hydrophobic residues-V (55.0%), I (20.4%), and L (7.9%). At position 6, asparagine (N) comprised 56.8% of all residues and the most significant replacement was for D and S, while position 9 was occupied by large hydrophobic amino acids-V (47.1%), L (12.1%), and I (18.2%). On the basis of the calculations, the following notation for consensus sequence was identified: E  Figure 2B graphically depicts frequency of each amino acid at every position of the identified GIP-9-like motifs.
Therefore, both SLiM types of interest contain a large proportion of conserved amino acid residues indicating that they are evolutionarily preserved through all biological species starting from bacteria and viruses to higher eukaryotes. Interestingly, the consensus sequences were enriched in D, S, and P residues found in prototype peptides, which have been proposed to give rise to modern proteins.

Retrieved Genes Ubiquitously Exist
Furthermore, we classified unique genes that code for proteins containing both SLiM types of interest on the basis of their belonging to any taxonomic category. We found that the retrieved genes are widely distributed among all taxonomic groups, including bacteria, viruses, archaea, and various invertebrate and vertebrate species, including mammals and primates. Figure 3A,B depicts the taxonomic distribution of unique gene coding for proteins with AFP 14-20 -like and GIP-9-like motifs, respectively.
Up to 64% and 74% of AFP 14-20 -like and GIP-9-like motifs, respectively, were found in bacterial proteins, while about 10% and 16% motifs, respectively, were identified in mammalian proteins. Some retrieved genes had orthologs in multiple biological species, therefore each of such genes was treated as a unique gene. For example, both SLiMs of interest were found in cytochrome c biogenesis protein CcmE and malate dehydrogenase from a wide range of bacterial species, while transmembrane protein TMEM258 was from various eukaryotic species (see Tables 1 and 2  Up to 64% and 74% of AFP14-20-like and GIP-9-like motifs, respectively, were in bacterial proteins, while about 10% and 16% motifs, respectively, were ident mammalian proteins. Some retrieved genes had orthologs in multiple biological s therefore each of such genes was treated as a unique gene. For example, both SL interest were found in cytochrome c biogenesis protein CcmE and malate dehydro from a wide range of bacterial species, while transmembrane protein TMEM258 wa various eukaryotic species (see Tables 1 and 2).

Retrieved Proteins Are Functionally Diverse
We used GO term annotations provided in the UniProtKB and InterPro datab classify all retrieved proteins according to molecular functions and biological proc egories (Figure 4). A total amount of terms can differ from the total amount of p aligned to each SLiM type of interest because (i) more than one GO term may be as to a unique protein and (ii) the same unique protein can belong to a variety of taxo categories. As shown in Figure 4A, metal ion binding, catalytic activity, and tran Prokaryotes-bacteria (blue), viruses (brown), archaea (grey); mammals-Homo sapiens (blue), primates (brown), other mammals (grey); vertebrates-birds (blue), fishes (brown), amphibia (grey); invertebrates-reptiles (blue), insects (brown), nematodes (grey); plants-higher plants (blue), algae (brown); other eukaryotes-S. cerevisiae (blue), fungi (brown), mollusks, scorpions, spiders, etc., (grey).

Retrieved Proteins Are Functionally Diverse
We used GO term annotations provided in the UniProtKB and InterPro databases to classify all retrieved proteins according to molecular functions and biological process categories (Figure 4). A total amount of terms can differ from the total amount of proteins aligned to each SLiM type of interest because (i) more than one GO term may be assigned to a unique protein and (ii) the same unique protein can belong to a variety of taxonomic categories. As shown in Figure 4A, metal ion binding, catalytic activity, and transferase activity were the predominant molecular function terms for AFP 14-20 -like motif-containing proteins. Additionally, there were proteins that exert oxidoreductase/electron transfer, DNA/RNA-binding, transcription factor, antimicrobial defense and immune response activities. The largest portion of proteins aligned to a GIP-9 segment of human AFP belonged to oxidoreductases and metal ion/iron-sulfur cluster binding, heme binding, and DNA binding proteins ( Figure 4B). of gene expression, translation, DNA repair, and protein quality control ( Figure 4D).
Additionally, prominent roles belonged to proteins involved in response to the pathogen and immune response. There were apparent relationships between molecular function and biological process terms. For example, DNA binding and metal ion binding activities can be assigned to transcriptional regulation, while electron transfer/oxidoreductase activities and, partly, metal ion binding activity underlie cell response to oxidative stress and antimicrobial, antifungal, and antiviral defense responses.

Prokaryotic Genes Are Required for Stress Tolerance
In order to identify the most statistically significant GO categories, we carried out gene set enrichment analysis with the use of ShinyGO v0.75 and gProfiler suites. In GO classification system, 389 unique genes encoding an AFP14-20-like motif containing proteins and 273 unique genes encoding a GIP-9-like motif containing proteins were mapped to the Ensembl gene IDs. Figure 5 depicts typical GO term-based functional categorization of genes encoding AFP14-20-like motif-containing proteins. From our gene set list, up to 41 bacterial genes were mapped to Ensembl genome IDs. Categorization of the retrieved proteins according to biological process terms showed that majority of AFP 14-20 -like motif-containing proteins are involved in transcriptional regulation, oxidative stress response, RNA processing, and host-pathogen defense response ( Figure 4C). GIP-9-like motif-containing proteins were involved in aerobic respiration/electron transfer, response to environmental stress, metabolic process, regulation of gene expression, translation, DNA repair, and protein quality control ( Figure 4D).
Additionally, prominent roles belonged to proteins involved in response to the pathogen and immune response. There were apparent relationships between molecular function and biological process terms. For example, DNA binding and metal ion binding activities can be assigned to transcriptional regulation, while electron transfer/oxidoreductase activities and, partly, metal ion binding activity underlie cell response to oxidative stress and antimicrobial, antifungal, and antiviral defense responses.

Prokaryotic Genes Are Required for Stress Tolerance
In order to identify the most statistically significant GO categories, we carried out gene set enrichment analysis with the use of ShinyGO v0.75 and gProfiler suites. In GO classification system, 389 unique genes encoding an AFP 14-20 -like motif containing proteins and 273 unique genes encoding a GIP-9-like motif containing proteins were mapped to the Ensembl gene IDs. Figure 5 depicts typical GO term-based functional categorization of genes encoding AFP 14-20 -like motif-containing proteins. From our gene set list, up to 41 bacterial genes were mapped to Ensembl genome IDs. Antioxidants 2023, 12, x FOR PEER REVIEW 12 of 26 As shown in Figure 5A, at FDR cutoff 0.2, bacterial genes associated with nucleotide/nucleic acid binding, ion/metal ion binding and ATP binding activities were retrieved As shown in Figure 5A, at FDR cutoff 0.2, bacterial genes associated with nucleotide/nucleic acid binding, ion/metal ion binding and ATP binding activities were retrieved at high statistical significance (low p-value) in molecular function categories. Not surprisingly, biological processes involved in metabolism and nucleotide/nucleic acid and amino acid biosynthesis required for bacterial reproduction were overrepresented ( Figure 5B). However, when all available gene sets were retrieved, oxidoreductase and chaperone activity as well as chemical stimuli/stress response and SOS response activities were identified among statistically significant categories identified for bacterial proteins ( Figure 5C). These data were confirmed by functional enrichment analysis of genes encoding GIP-9-like motif-containing proteins. From our gene list, up to 29 unique genes were mapped to Ensembl genome IDs in each bacterial taxonomy. Figure 6A-C depicts the all-available gene set enrichment analyses for three representative bacterial species. As shown in Figure 6, pathways associated with metabolic processes, nucleic acid and protein biosynthesis, translation, and DNA repair are the most statistically significant. However, pathways that underlie cellular response to biotic and abiotic stress and chemical stimuli were identified. They included SOS response and oxidative stress response that occurs with the involvement of oxidoreductase/electron transfer enzymes including those containing Fe-S clusters.
2, x FOR PEER REVIEW 14 of 26     In humans, up to 54 unique genes encoding AFP 14-20 -like motif-containing proteins were mapped to Ensembl gene IDs. In other mammalians, the amount of corresponding unique genes constituted from 34 to 53 and from 14 to 20, respectively.

Eukaryotic Genes Are Responsible for Stress and Defense Response
In GO-based molecular function terms, H. sapiens protein/receptor binding, ion/metal ion-binding and calcium-binding, DNA and heterocyclic compound (nucleotide)-binding as well as dioxygenase and oxidoreductase activities were among the most significant categories ( Figure 8A). As expected, in GO biological process terms, biosynthetic and developmental processes as well as cell communication and cell signaling pathways were identified as the most significant functional terms ( Figure 8B).
Unexpectedly, response to stress and chemical stimulus and DNA damage were also among statistically significant biological processes. This picture was typical for various animal species, where a wide range of stress response proteins including oxidoreductases, ubiquitin activating enzymes, channel activity regulators, and cell signaling proteins were retrieved. For example, in plants, up to 47 unique genes encoding AFP 14-20 -like motifcontaining proteins were mapped to Ensembl gene IDs. These included proteins important for cell division such as those involved in RNA binding, nucleotide biosynthesis, and translation. Interestingly, those implicated in stress/defense response such as oxidoreductases and proteins involved in killing of other organisms were also among significant ones in plants ( Figure 8C,D).
As for GIP-9-like motif-containing proteins, lower quantities of statistically significant GO terms were identified (Figure 9). Up to 21 unique genes in mammalians and up to 15 unique genes in plants were mapped to Ensembl gene IDs.
In H. sapiens GO molecular function terms, protein and nucleotide binding activities along with chaperone and ion/metal ion binding activities were among overrepresented molecular function terms ( Figure 9A). In biological process terms, immune and defense response as well as autophagy and apoptosis ( Figure 9B) were identified among significant human genes. In plants, NADPH-dependent oxidoreductase, ion channel, and RNA/DNA binding activities, which underlie response to external stimulus, protein localization, and cellular metabolism were identified ( Figure 9C,D).   Color codes indicate data inferred from: dark brown-experiment/direct assay, light brown-genetic and physical interactions, yellow-sequence similarity, dark purple-high throughput experiment, green-curator, blue-reviewed computational data.
As for GIP-9-like motif-containing proteins, lower quantities of statistically significant GO terms were identified (Figure 9). Up to 21 unique genes in mammalians and up to 15 unique genes in plants were mapped to Ensembl gene IDs.
In H. sapiens GO molecular function terms, protein and nucleotide binding activities along with chaperone and ion/metal ion binding activities were among overrepresented molecular function terms ( Figure 9A). In biological process terms, immune and defense response as well as autophagy and apoptosis ( Figure 9B) were identified among significant human genes. In plants, NADPH-dependent oxidoreductase, ion channel, and RNA/DNA binding activities, which underlie response to external stimulus, protein localization, and cellular metabolism were identified ( Figure 9C,D). A. thaliana genes categorized in (C) molecular function terms and (D) biological process terms. Color codes indicate data inferred from: dark brown-experiment/direct assay, light brown-genetic and physical interactions, yellow-sequence similarity, dark purple-high throughput experiment, green-curator, blue-reviewed computational data. Color codes indicate data inferred from: dark brown-experiment/direct assay, light brown-genetic and physical interactions, yellow-sequence similarity, dark purple-high throughput experiment, green-curator, blue-reviewed computational data.
As for GIP-9-like motif-containing proteins, lower quantities of statistically significant GO terms were identified (Figure 9). Up to 21 unique genes in mammalians and up to 15 unique genes in plants were mapped to Ensembl gene IDs.
In H. sapiens GO molecular function terms, protein and nucleotide binding activities along with chaperone and ion/metal ion binding activities were among overrepresented molecular function terms ( Figure 9A). In biological process terms, immune and defense response as well as autophagy and apoptosis ( Figure 9B) were identified among significant human genes. In plants, NADPH-dependent oxidoreductase, ion channel, and RNA/DNA binding activities, which underlie response to external stimulus, protein localization, and cellular metabolism were identified ( Figure 9C,D).  Color codes indicate data inferred from: dark brown-experiment/direct assay, light brown-genetic and physical interactions, yellow-sequence similarity, dark purple-high throughput experiment, green-curator, blue-reviewed computational data.

Discussion
SLiMs are often found in the rapidly evolving intrinsically disordered regions of proteins and the motif acquisition can proceed through the convergent evolution [92]. Frequent mutations, small size, and low complexity make it difficult to identify motifs and to study their functions. Here, we used bioinformatics and GO enrichment analyses to search for SLiMs with sequence similarity to two AFP-derived sequences, LDSYQCT (AFP14-20) and EMTPVNPGV (GIP-9). We identified a vast array of similar motifs across all taxonomic categories including bacteria, viruses, archaea, and various eukaryotic species.
One of the most prominent molecular functions of human and rodent AFPs is metal ion binding capability [93], which is similar to activities of majority of the retrieved in our study proteins. This capability underlies the involvement of proteins in various cellular processes including metabolism, transcriptional regulation, and redox regulation. Most of prokaryotic proteins were, unsurprisingly, involved in nucleotide, nucleic acid, amino acid, and protein biosynthesis necessary for their reproduction. However, the overwhelming majority of both prokaryotic and eukaryotic proteins including enzymes, transcription factors, quality control, and ribosomal proteins were involved in the cellular adaptation to environmental changes and various stress conditions. Our data suggest that AFP can use the SLiMs of interest to provide cellular adaptation to stress conditions during embryonic development and cancer growth. A. thaliana genes categorized in (C) molecular function terms and (D) biological process terms. Color codes indicate data inferred from: dark brown-experiment/direct assay, light brown-genetic and physical interactions, yellow-sequence similarity, dark purple-high throughput experiment, green-curator, blue-reviewed computational data.

Discussion
SLiMs are often found in the rapidly evolving intrinsically disordered regions of proteins and the motif acquisition can proceed through the convergent evolution [92]. Frequent mutations, small size, and low complexity make it difficult to identify motifs and to study their functions. Here, we used bioinformatics and GO enrichment analyses to search for SLiMs with sequence similarity to two AFP-derived sequences, LDSYQCT (AFP [14][15][16][17][18][19][20] and EMTPVNPGV (GIP-9). We identified a vast array of similar motifs across all taxonomic categories including bacteria, viruses, archaea, and various eukaryotic species.
One of the most prominent molecular functions of human and rodent AFPs is metal ion binding capability [93], which is similar to activities of majority of the retrieved in our study proteins. This capability underlies the involvement of proteins in various cellular processes including metabolism, transcriptional regulation, and redox regulation. Most of prokaryotic proteins were, unsurprisingly, involved in nucleotide, nucleic acid, amino acid, and protein biosynthesis necessary for their reproduction. However, the overwhelming majority of both prokaryotic and eukaryotic proteins including enzymes, transcription factors, quality control, and ribosomal proteins were involved in the cellular adaptation to environmental changes and various stress conditions. Our data suggest that AFP can use the SLiMs of interest to provide cellular adaptation to stress conditions during embryonic development and cancer growth.

AFP 14-20 -like Motif-Containing Proteins
We found that most bacterial and archaeal proteins containing short segments aligned with the AFP 14-20 at high statistical significance (E-value of~10 −5 -10 −4 ) are involved in maintaining cellular redox balance (Table 1). For example, iron-sulfur (Fe-S) cluster proteins such as rubredoxins, ferredoxins, anaredoxin, and desulfoferrodoxin exert antioxidant activity and play important roles in bacterial adaptation to environmental changes [44,46]. These proteins have a unique structural characteristic of four Cys residues that surround the Fe-S clusters involved in electron transfer from cognate reductases to cytochrome P-450s enabling maintenance of the pathogen viability [35]. Fe-S clusters are found in many enzymes central to metabolic processes such as nitrogen fixation, respiration, and DNA processing and repair. Additionally, enzymes with flavin oxidoreductase activity such as choline dehydrogenase (Cdh), which oxidizes choline to betaine aldehyde for its further oxidation to betaine, were retrieved. Betaine is a source of CH 3 -group for biosynthesis of nucleotides, amino acids, etc., and provides adaptation of phototrophic bacteria to osmotic stress [53]. Choline oxidation is associated with electron transfer to the electron transportation chain (ETC) and ROS generation [38]. Reasonably, NADH-quinone oxidoreductase, ETC complex I, that is of the major sites of ROS production in many bacterial strains [65], was also aligned to AFP 14-20 segment. Additionally, variety of modulators of environmental stress response were aligned to AFP 14-20 segment. Histidine kinase response regulator protein, stress response protein YhaX [43,59], Sel1 domain-containing protein [42], and RagB/SusD family nutrient uptake outer membrane protein [39], which regulate host cell response to pathogen were among them. Moreover, bacterial 8-oxo-dGTP diphosphatase MutT and dITP/XTP pyrophosphatase enzymes, which are involved in SOS response due the removal of oxidatively damaged and non-canonical nucleotides, were retrieved [41].
Transcription factors that regulate gene expression in bacteria, archaea, and viruses for their adaptation to environmental stress conditions were also among the retrieved proteins. They included a helix-turn-helix domain-containing AraC family and a TetR transcriptional regulator that typically bind to target DNA and regulate pathogenic properties by sensing small molecule inducers such as urea, bicarbonate, and glycerol, etc. [49,52]. Bacterial ribosomal enzymes that catalyze posttranslational modification of proteins involved in translation were also aligned to the AFP [14][15][16][17][18][19][20] segment. An example is rimI that encodes the ribosomal protein S18-alanine acetyltransferase [36]. Proteins involved in host-pathogen interaction via promoting nucleic acid replication and host adaptive immune response were found among viral proteins. They included host range factor 1 [48] and infected cell protein 47 (ICP47) [60], which function under redox changing. For example, ICP47 directly binds antigen-dependent transporter (TAP), leading to the occurrence of empty MHC-I that is under redox control due to disulfide bond oxidation/reduction [94].
In plants, the Rho family of Ras-related GTP-binding (Rop) proteins work as signaling switches that control growth, development and apoptosis in responses to various environmental stimuli [54]. A highly conserved catalytic PRONE (plant-specific Rop nucleotide exchanger) domain-containing proteins with strong substrate specificity for members of the Rop family were aligned to AFP 14-20 segment. Additionally, developmental proteins with antimicrobial activity such as gibberellic acid-stimulated Arabidopsis (GASA) [66] were retrieved. There was also, though at low significance, the acidic leucine-rich nuclear phosphoprotein 32-related protein 2 involved in histone chaperone activity and the integration of environmental stress response in plants and immunomodulation and tumor progression in humans [95].
In animals, a variety of small proteins with ion channel regulator and toxin activity such as U-scoloptoxin [55], auger peptide hheTx2 [54], leiurutoxin-3 [62], and others produced by various mollusks, snakes, and insects were aligned with the AFP 14-20 segment. Additionally, Cys-rich and metal ion binding small proteins including defensins, ranatuerins, and brevinines figure prominently in the alignment. These host defense proteins have key roles in oxidative stress response, immune response, and antimicrobial, antifungal, and antiviral activities [61]. Defensins have been implicated tumor growth exhibiting both tumor-suppressive and tumor-promoting effects [56]. In human carcinomas, defensins exert antitumor effects due to induction of apoptosis, inhibiting angiogenesis, and immunomodulation.
Furthermore, transcription regulators that are involved in response to changes in microenvironmental conditions have been retrieved. They include CCHC-type domaincontaining protein, C2H2-type zinc finger protein 142, nucleus accumbens-associated protein 1 (NAC1), and retinoic acid receptor RXR-gamma-B involved in various diseases including cancer and neurodevelopmental disorders [37,40,60,63]. Additionally, the importance of the extraction of calcium-binding EGF-like domain protein is that the AFP 14-20 motif is a part of EGF and EGF-like domains involved in various signaling pathways [51]. Among them are JAG1/Notch signaling cascades, which activate a number of oncogenic factors that regulate cell proliferation, metastasis, angiogenesis, and drug-resistance [47]. Furthermore, denticleless protein homolog (DTL) has been associated with response to DNA damage and the immunosuppressive tumor microenvironment [45].
Additionally, thermonuclease family ribonuclease HII and PINc domain-containing proteins that are involved in DNA and RNA degradation under stress conditions to provide bacterial defense mechanism [75,89] were among the retrieved proteins. Furthermore, components of bacterial toxin-antitoxin systems such as addiction module HigA family antidote, which promote adaptation and persistence by modulating bacterial growth in response to stress [86], were also retrieved.
Qualitatively, most bacterial proteins, including the cupredoxin domain-containing protein play pivotal roles in many metabolic pathways and regulation of redox homeostasis that are crucial for the pathogen survival [82]. NADPH-dependent oxidoreductases such as malate dehydrogenase and short-chain dehydrogenase (SDR) family oxidoreductase are among enzymes that undergo thiol group-redox switch for the involvement in adaptive response to oxidative stress conditions [78]. These also include a cytochrome c biogenesis protein that provides heme binding to apoprotein of cytochrome c and cytochrome c-552, the components of electron transfer and mitochondrial redox regulation [50,73]. Other proteins containing GIP-9-like segments involved in the pathogen response to oxidative stress included protein kinases and proteases.
Many proteins, including those responsible for cell cycle control and embryonic development are regulated under oxidative stress conditions. For example, the Cys residue of CoA-binding protein can undergo S-thiolation in response to oxidative and metabolic stress [71]. Some of bacterial stress response protein homologs are implicated in disease pathogenesis in humans. For example, divalent cation tolerance protein CutA homolog has been proposed to mediate acetylcholinesterase activity and copper homeostasis, which are implicated in Alzheimer's disease [87]. In proteobacteria, cell division proteins display redox transformation due to electron transfer and reduction of oxygen, nitrogen, and hydrogen sulfide [72]. Among viral proteins, the envelope glycoprotein E that is involved in host immune response to pathogen and viral protein kinase that have a role in virus virulence and tumor pathogenesis [77] were retrieved.
Similar to AFP 14-20 -like motifs, GIP-9-like motifs were identified in small Fe-S clustercontaining proteins such as ferredoxins and Cisd2-a protein [80]. Furthermore, coevolution of bacteria with their hosts enabled them to tolerate oxidative stress conditions with the use of an antioxidant system (AOS) that includes both enzymatic and non-enzymatic components [96]. In this context, cupin domain-containing proteins contribute to counteracting the host defense due to functional diversity that includes an AOS component, the superoxide dismutase (SOD) enzyme [74]. Additionally, glutathione S-transferases play important roles in the environmental stress response due to S-glutathionylation of Cys residue and thiol groups resulting in target molecule detoxification [91].
In eukaryotes, GIP-9-like motifs were found in proteins such as small proline-rich proteins regulating cell cycle and cell proliferation and differentiation [85]. Interestingly, these proteins can be involved in tumor progression and their functioning is under redox control [97]. Additionally, ceruloplasmin, a major copper-carrying plasma protein that possesses ferroxidase activity and is involved in redox regulation [84], was retrieved. In plants and algae, photosystem II stability and assembly factor HCF136 that is essential for the formation of photosystem II complex and plastocyanin-like domain-containing protein that is involved in electron transfer during photosynthesis [81] were retrieved. They are regulated by redox switches between active-inactive states during light-dark transition [98]. Additionally, various stress-related proteins involved in quality control machinery including a C2H2-type zinc finger-containing protein and zinc metalloproteinases [85] were identified among GIP-9-like motif-containing proteins. Moreover, a variety of transmembrane proteins involved in ER stress response such as TMEM258 [99] and antimicrobial peptides, though at lower statistical significance, peptides were retrieved.
In mammals, members of homeobox family transcription factors such as forkhead box protein O1 (FOXO1) and homeobox protein Hox-C5 (HOXC5) that play important roles in metabolism, cell proliferation, apoptosis, development, and stress resistance [90] were identified. Additionally, HSP family members, along with tumor necrosis factor (TNF) ligand family cytokines and Wnt-1 protein involved in Wnt/β-catenin signaling pathway, key players in redox regulation and cancer development [100], were among the retrieved proteins though at lower significance.

Conclusions
In our study, we undertook a comprehensive functional enrichment analysis of a wide range of proteins from all taxonomic groups and different functional classes. All these proteins have similar structural characteristics regarding the presence of conserved SLiMs. The both types of short sequences used as queries for sequence similarity search were derived from AFP, a major mammalian embryo-specific and tumor-associated protein. Therefore, the identification of a variety of transcription factors and proteins involved in cell signaling, cell cycle progression, cell proliferation and differentiation, and protein quality control was anticipated. However, unexpectedly, various prokaryotic and eukaryotic proteins responsible for cellular response to both biotic and abiotic stress were retrieved as containing the both AFP 14-20 -like and GIP-9-like motifs. They included proteins implicated in the adaptation and protection against pathogens, reactive oxygen species, toxins, and various chemical agents. Moreover, the overwhelming majority of retrieved transcription factors and proteins involved in replication and translation were reported to participate in cellular and organismal adaptation environmental stress stimuli.
We hypothesized that both the AFP-derived peptides can arise from prototype peptides during the long evolutionary time. At the early stages of biochemical evolution, these peptides were involved in cellular stress response and preserved this function in modern proteins, including AFP. Therefore, bioinformatics and GO functional enrichment analyses of SLiMs allows insight into the common functions of a variety proteins and the involvement of AFP in cellular response to external and internal stimuli during embryonic development and cancer growth. Nevertheless, our data require further confirmation with the use of experimental approaches.

Conflicts of Interest:
The authors declare no financial, professional, or personal competing interests.