In Silico Estimation of the Abundance and Phylogenetic Signiﬁcance of the Composite Oct4-Sox2 Binding Motifs within a Wide Range of Species

: High-throughput sequencing technologies have greatly accelerated the progress of genomics, transcriptomics, and metagenomics. Currently, a large amount of genomic data from various organisms is being generated, the volume of which is increasing every year. Therefore, the development of methods that allow the rapid search and analysis of DNA sequences is urgent. Here, we present a novel motif-based high-throughput sequence scoring method that generates genome information. We found and identiﬁed Utf1-like, Fgf4-like, and Hoxb1-like motifs, which are cis-regulatory elements for the pluripotency transcription factors Sox2 and Oct4 within the genomes of di ﬀ erent eukaryotic organisms. The genome-wide analysis of these motifs was performed to understand the impact of their diversiﬁcation on mammalian genome evolution. Utf1-like, Fgf4-like, and Hoxb1-like motif diversity was evaluated across genomes from multiple species.


Introduction
The processes of cell reprogramming to a pluripotent state at the molecular level starts with protein-protein convergence caused by binding to neighboring DNA sites [1,2]. The transcription factors of pluripotency Sox2 (SRY-box 2), Oct4 (Octamer-binding transcription factor 4), and Nanog are key in the transcriptional network that controls stem cell pluripotency and the induction of pluripotency in somatic cells [3][4][5]. Sox2 belongs to a large group of Sox family proteins first discovered in 1990 [6][7][8]. All Sox proteins have the high-mobility group (HMG) box domain that may mediate non-sequence-specific and sequence-specific DNA binding [9]. HMG proteins are ubiquitous and abundant nuclear proteins that bind to nucleosomes and cause structural changes in chromatin. These non-histone proteins perform a significant role in DNA replication, recombination, transcription, and DNA repair processes. Most Sox TFs bind to and regulate different sets of genes in different cellular contexts. For example, Sox2 participates in a stunningly diverse class of cells and tissue types, including pluripotent stem cells, lung tissue, nerve lines, ear, and eye [10,11].
Among DNA-binding transcription factors, the POU genes represent a large group and play a fundamental role in cell-type specification and developmental regulation. The abbreviation POU originates from the names of three mammalian transcription factors, the pituitary-specific Pit-1, the octamer-binding proteins Oct-1 and Oct-2, and the neural Unc-86 from Caenorhabditis elegans [12]. Several detailed reviews have discussed DNA-binding by POU TFs and their function in mammalian Data 2020, 5, 111 2 of 9 development [12][13][14]. Despite the fact that the domain is widespread across various species of living organisms, the sequence of this domain is nevertheless strictly conservative [15]. POU genes consist of the following three parts: A N-terminal POU-specific domain (POU'S'), a C-terminal homeodomain (POU'HD'), and a linker region between the two. OCT4 is a member of the octamer-binding subgroup of the POU family of transcription factors [16,17], which binds to the octamer motif (ATGCAAAT consensus sequence) using a bipartite DNA-binding POU domain.
It was previously described that Oct4 and Sox2 cooperatively control the pluripotent-specific expression of several genes by binding together with the cis-regulatory element Oct-Sox, and thereby regulate the transcription of important target genes, such as Fgf4, Utf1, Pou5f1(Oct4), Nanog, and Sox2 [11,18]. The binding sites of the DNA-associated pluripotency transcription factors were identified by the ChIPseq technique applied genome-wide for mapping TF binding regions in living cells [19,20]. The method combines chromatin immunoprecipitation (ChIP) using TF-specific antibodies with high-throughput next-generation parallel sequencing [21,22]. Along with this typical ChIP analysis, modern computational methods are also becoming increasingly important for genome-wide studies of protein-DNA interactions.
Nucleotide sequences outside of coding regions tend to be less conserved among organisms unless they are important for function, that is, where they are involved in the regulation of gene expression. Thus, the discovery of motifs in protein and nucleotide sequences can lead to the definition of function and clarification of the evolutionary relationship between sequences. In this work, we describe a novel approach, the DNA sequences profiling of human genome or other organisms, with the examples of Utf1-like, Fgf4-like and Hoxb1-like motifs, which are cis-regulatory elements for pluripotency transcription factors Sox2 and Oct4. We present validation results and provide examples that demonstrate its application in phylogenetic analyses between different species.
Genome-wide analyses of these motifs in a single species have been previously conducted to estimate common motifs, evolution, patterns of expression, and predicted localization. Multi-genome analyses of Utf1-like, Fgf4-like, and Hoxb1-like motifs resulted in a more rigorous and consistent description, and provided insight into their evolution. However, there is currently a lack of targeted studies investigating this question across multiple species. To aid in furthering our understanding of these regulatory elements, we performed multiple bioinformatics analyses to rigorously define the conserved motifs across multiple species, and to examine how they are spread throughout mammalian and other species.

Bioinformatic Identification of Sox2/Oct4 Motif Sequences
To analyze the frequencies of the Sox2/Oct4 motif sequences, we used FastPCR software (PrimerDigital, Helsinki, Finland) [23,24] with the "Restriction" tool that allowed custom degenerated sequences to search the genomes of various organisms. For this, the flag "-2" was used, which allows for collecting all cases of complementary coincidences both for one chromosome and in total for the entire genome. All other parameters were left at their default values. The program performed a forward and complementary search for each sequence motif. For each degenerated sequence motif, a table was generated indicating a specific sequence in the genome and its frequency.
A list of the screened genomes that were searched is presented in Table S2 and S3. The use of Sox2/Oct4 motif sequences allowed the assessment of how the software handles complex tasks, such as in silico degenerate pattern searching. The sequence motif lengths can be at least 4 nucleotides. The analysis results include a table and are presented separately for each sequence motif and its frequency. The program "decodes" the degenerated sequence motif and presents the frequencies for each of them. For each genome under study, tables were obtained in a tab-delimited format, and were combined and sorted by text.
Additionally, neutral sequences for control analysis used random sequences, for example, inverted Utf1-like sequence (ATGYWDGDnHWTTSW).
The resulting Sox2/Oct4 motif sequence was used as the search query with FastPCR software against all studied genome sequences (listed in Table S2) using the flag "-2" with all other settings at their default values. This resulting sequence dataset (Table S3) was used for all subsequent analyses.

A Phylogenetic Tree of Sox2/Oct4 Motif Sequences
The Sox2/Oct4 motifs discovered by FastPCR software [23,24] were used as input to define the phylogeny of studied genomes. To determine the number of sites and the average distance between them, an in-house script was developed to analyze the FastPCR results. Likewise, an in-house script was used to compute the Nei's standard genetic distance [25] between each species based on a comparison of the frequencies of each motif between species: where J xy is calculated as the sum of shared frequencies of a motif sequence by genome x and y normalized for each genome. Large and small genomes will have different values for each motif; therefore, the compared frequencies of the motifs were calculated for each specific genome. The J xy value ranges from 0 to 1, depending on the degree of frequency coincidence between two unrelated genomes. The J x or J y values correspond to the number of motifs for genome x and y.
A phylogenetic tree of all the species used in this study was created using the MEGA X software (Pennsylvania State University, USA) [26].

Results and Discussion
Based on results from the proximity utilizing the biotinylation (PUB) method in a living cell, we previously discovered a high level of biotin labeling of transcription factors Sox2 and Oct4 in comparison with various control proteins [27]. The Sox2 protein binds to the CATTGTT sequence, while Oct4 recognizes the octameric consensus ATGCTAGT sequence [18,[28][29][30][31][32][33][34][35][36][37][38][39], such as in undifferentiated embryonic cell transcription factor 1 (Utf1) or other motifs (Supplementary material Figure S1). Usually, these recognizing DNA sites are linked together to form a composite motif, known also as the "canonical" motif. Thus, the interaction between these key transcription factors of pluripotency proceeds via the DNA-binding domains HMG, POU'S', and POU'HD'.
An analysis of the literature on the distribution of motif variants in genomes showed the presence of some differences in the sequences. For example, in the mouse genome, there are two variants of the motifs Fgf4 (1 and 2 points of Table 1) and Utf1 (6 and 7 points of Table 1). For both motifs, matches were found in the genomic database Ensembl (Supplementary Table S2). It is interesting to note that other sequences in which there are combinatorial replacements of the Oct4 moiety in the Utf1 motif ATGCTAGT with another ATGCTAGA are also present in the mouse genome. Therefore, we searched for DNA sequences using generalized formulas. To generate variants of the Sox2/Oct4 motifs, we used the FastPCR software [23,24] with the "Restriction" tool that allowed custom degenerated sequences to search in the genomes of various organisms. According to the IUPAC nucleotide nomenclature rules, the following generalized motif sequences as described in the manuscript of Tapia et al. [28] were used: Fgf4-like (HWTTSWnnnnATGYWDWD), Utf1-like (HWTTSWnATGYWDGD), and Hoxb1-like (HWTTSWnATGYWDWD). The largest numbers of sites were found for the Fgf4-like and Hoxb1-like motifs (Supplementary material Table S1).
Determining the molecular genetic relationships of organisms can be performed on the common features for all organisms. As a rule, for this purpose, universal genes characteristics of living organisms are chosen, such as ribosomal RNA genes.
In this study, we suggested that for genetically related species, there should also be a general trend for evolutionarily ancient promoter regions, such as Utf1-like, Fgf4-like, and Hoxb1-like canonic variant motifs. To test this, we analyzed the genome-wide sequences of some closely related and distant mammalian species, and some representatives of marsupials, birds, amphibians, insects, and plants.
Since the studied sequences are characterized as promoter regions only for mammals, the genetic relationship should be well traced precisely among animal species. In the plant kingdom and for insects, these sequences have a different evolutionary nature.
To determine the distance coefficient between the compared species according to Utf1-like, Fgf4-like, and Hoxb1-like canonic variant motifs, we proceeded from the following. For related species, the amount of each competitive sequence must be similar given the size of the genome. The larger the genome size, the more variants of a particular sequence will be observed for each canonic variant motif. The closer the genomes are, the more similarity there will be in the frequency and sequence quality for each canonic motif. The total distance is defined as the sum of all coefficients for all frequencies of each sequence in each canonic motif. To calculate such distances, we used Nei's standard genetic distance, considering the coincidence of individual sites and their frequencies, and normalized to the total number of sites per genome for the species being compared.
Next, we examined the distribution of Utf1-like, Fgf4-like, and Hoxb1-like canonic variant motifs for various members of the animal kingdom.
For the Hoxb1-like canonic variant motif (HWTTSWnATGYWDWD), a maximum of 107,723 sites were identified in the Mus musculus genome and a minimum of 4304 for the insect Nasonia vitripennis. For the Sciurus vulgaris and Mus musculus genomes, the largest number of variants of this motif was identified (6910 unique sequences). The plant genomes were also characterized by a low number (HWTTSWnnnnATGYWDWD) revealed the largest variants of this motif (192,940 sites and 114,914 unique variants for the human genome). The minimum number of this site in the insect Nasonia vitripennis was 4507, with 4317 unique variants. Phylogenetic analysis was performed by the UPGMA method, which most accurately characterizes the genetic relationship of the studied species. In general, the phylogenetic analysis for this site overall coincides with what we already observed for the Hoxb1-like and Utf1-like sites; some changes concerning the position of Gallus gallus were isolated from mammals' genomes, which better corresponds to the phylogeny of these species.
To examine the distribution of the Sox2/Oct4 motif sequences among mammals in more detail, we made a simplified grouping of the studied species. Because these clades consist of a different number of species, all of the population fractions were reweighted to normalize the results so as to facilitate comparison.

Conclusions
In this work, we performed a genome-wide analysis of the cis-regulatory elements and promoters for several mammals and distantly related species, including animals such as insects and amphibians and some plant species. We also studied the extent to which these data are common among these eukaryotes in accordance with genome size and evolutionary relationship between these species. Our hypothesis was that closely related species should have a similar frequency of occurrence of these sequences in relation to the genome size. If these sequences are evolutionarily neutral, then we can trace them in a variety of species that are evolutionarily distant from mammals. In the case of the evolutionary significance of these sequences, we could observe the absence of any connection between the frequencies of these sequences and the phylogeny of the compared species. The frequency trend of these sites is in good agreement with the relatedness of these species. For very closely related species, these values practically coincide.
An analysis of the occurrence frequency of the Utf1-like canonic variant motif (HWTTSWnATGYWDGD) site for the studied species showed that the minimum number of this site in the insect Nasonia vitripennis is 1077, and the maximum in humans is 83,559. For the Sciurus vulgaris genome, the largest number of variants of this motif was revealed (3451 unique sequences for the human genome). The phylogenetic analysis for this site generally coincides with what we observed already for the Hoxb1-like site, with some changes regarding the position of Gallus gallus among mammals. This is an interesting fact that is associated with the high number of unique sequences of this motif in the genome of Gallus gallus (3287), with a low number of these sites per genome (16,515).
An analysis of the frequency of occurrence of the site Fgf4-like canonic variant motif (HWTTSWnnnnATGYWDWD) revealed the largest variants of this motif (192,940 sites and 114,914 unique variants for the human genome). The minimum number of this site in the insect Nasonia vitripennis was 4507, with 4317 unique variants. Phylogenetic analysis was performed by the UPGMA method, which most accurately characterizes the genetic relationship of the studied species. In general, the phylogenetic analysis for this site overall coincides with what we already observed for the Hoxb1-like and Utf1-like sites; some changes concerning the position of Gallus gallus were isolated from mammals' genomes, which better corresponds to the phylogeny of these species.
To examine the distribution of the Sox2/Oct4 motif sequences among mammals in more detail, we made a simplified grouping of the studied species. Because these clades consist of a different Data 2020, 5, 111 6 of 9 number of species, all of the population fractions were reweighted to normalize the results so as to facilitate comparison.

Conclusions
In this work, we performed a genome-wide analysis of the cis-regulatory elements and promoters for several mammals and distantly related species, including animals such as insects and amphibians and some plant species. We also studied the extent to which these data are common among these eukaryotes in accordance with genome size and evolutionary relationship between these species. Our hypothesis was that closely related species should have a similar frequency of occurrence of these sequences in relation to the genome size. If these sequences are evolutionarily neutral, then we can trace them in a variety of species that are evolutionarily distant from mammals. In the case of the evolutionary significance of these sequences, we could observe the absence of any connection between the frequencies of these sequences and the phylogeny of the compared species. To perform the analysis, we wrote a script and used FastPCR programs to search for these sequences genome-wide. We selected a mathematical apparatus for calculating genetic distances for the data used (a list of sequences, and their frequencies in relation to the size of the genome).
The results of this work indicate that the genome-wide analysis of the frequencies of cis-regulatory elements is shown to be neutral. That is, we can apply these data for phylogenetic analysis. Thus, we have shown that our assumption in the analysis of the frequencies of certain sequences, including sequences of regulatory elements and promoters, can be used to calculate evolutionary distances and construct a phylogenetic tree. Although the investigated sequences of regulatory elements and promoters have a functional role and should be subjected to selection, we observed that these sequences are evolutionarily neutral and applicable for revealing the relationship of the compared genomes.
Author Contributions: A.K. collected and analyzed data, wrote the manuscript, managed the project, and acquired financial support for the project leading to this publication; R.K. analyzed data using FastPCR software, wrote the manuscript, and provided critical reviews and commentary to the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by a grant from the Ministry of education and science of the Republic of Kazakhstan (AP05132131). Open access funding was provided by University of Helsinki including Helsinki University Central Hospital. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Acknowledgments:
We warmly thank Alexander Bolshoy (Palacký University Olomouc, Czech Republic) for critical reading of the manuscript.

Conflicts of Interest:
The authors declare that they have no conflict of interest. Undifferentiated embryonic cell transcription factor 1