Molecular Evolution and Inheritance Pattern of Sox Gene Family among Bovidae

Sox genes are an evolutionarily conserved family of transcription factors that play important roles in cellular differentiation and numerous complex developmental processes. In vertebrates, Sox proteins are required for cell fate decisions, morphogenesis, and the control of self-renewal in embryonic and adult stem cells. The Sox gene family has been well-studied in multiple species including humans but there has been scanty or no research into Bovidae. In this study, we conducted a detailed evolutionary analysis of this gene family in Bovidae, including their physicochemical properties, biological functions, and patterns of inheritance. We performed a genome-wide cataloguing procedure to explore the Sox gene family using multiple bioinformatics tools. Our analysis revealed a significant inheritance pattern including conserved motifs that are critical to the ability of Sox proteins to interact with the regulatory regions of target genes and orchestrate multiple developmental and physiological processes. Importantly, we report an important conserved motif, EFDQYL/ELDQYL, found in the SoxE and SoxF groups but not in other Sox groups. Further analysis revealed that this motif sequence accounts for the binding and transactivation potential of Sox proteins. The degree of protein–protein interaction showed significant interactions among Sox genes and related genes implicated in embryonic development and the regulation of cell differentiation. We conclude that the Sox gene family uniquely evolved in Bovidae, with a few exhibiting important motifs that drive several developmental and physiological processes.


Introduction
The Sox genes belong to a large group of genes whose DNA binding domain is called a high mobility group (HMG), encoding diverse and well-conserved transcription factors, and consisting of 20 genes in vertebrates and only a handful in invertebrates [1,2]. Sox genes share homology with the HMG box of the sex-determining region Y (Sry) gene, a founding member of the Sox gene family, which is required to specify the male phenotype. This group of genes owes their acronym to sharing 46% or more identity to the Sry gene in the HMG box [3].
Vertebrate Sox genes have been divided into eight groups (A-H) based on conserved protein domains and similarity in their amino acid sequences [2,4]. Members of the same group are highly similar to each other within and outside the HMG box, while those of different groups share a lower degree of identity in the HMG box and no significant identity outside the domain [3]. Most members of this group are scattered throughout the genome, with the exception of Sry and Sox3, which are located on sex chromosomes [2].
Sox proteins control transcription in several ways; for example, the protein domain connects the minor groove of DNA to AACAAAG, AACAAT, and related sequences and induces a sharp bend in DNA, thereby allowing the protein to play an important structural role in the assembly of transcription enhancer complexes [3]. They exhibit their structural role by shaping the regulatory regions and establishing physical contacts between transcription factors bound to the same target gene promoter or enhancer [2]. Thus, the regulatory functions of Sox proteins require the cooperation of interacting partners that bind DNA and enable the specific selection of target genes. This type of cooperation is dynamic, allowing Sox proteins to regulate different events by changing partner factors-a key factor that drives the progression of developmental processes [5].
All Sox genes have a specific expression pattern and mostly play important roles in the determination of cells fate and the differentiation of cells into specific lineages, such as embryonic stem cells, neuronal and glial cells, Sertoli cells, chondrocytes, etc. [3]. They have also been shown to be capable of reprogramming differentiated somatic cells into pluripotent stem cells [6,7], activating genes that are important for maintaining the pluripotent cell state [8], and inducing the delta crystallin gene in the lens [9].
The functions of the members of the Sox B1 (Sox) family overlap, although each has a function during the migration of neuronal precursors in the ganglionic eminence [10]. Sox2 and Sox3 deficiency cause pituitary defects [11]. Single knockouts of Sox5 or Sox6 in mice lead to some skeletal abnormalities, and a Sox5 and Sox6 double knockout leads to a lack of chondrogenesis [12]. Mutations in Sox18 induce severe cardiovascular and hair follicle defects including lymphatic dysfunction in developing embryos [13,14]. Whereas mutational disruptions in Sox9 lead to the failure of sertotic placode invagination [15]. Members of the SoxF subfamily exhibit overlapping functions [16] and are expressed in endothelial cells, while their mutation has been reported in lymphatic defects [17]. Kamai-Azuma et al. [18] also reported defects in gut tube formation in Sox17 knockout mice. Members of the SoxC subfamily (Sox4, 11, and 12) are expressed in neural and mesenchymal progenitor cells. Mutations in these subfamilies lead to the malformation of various organs. Sox4 and Sox11 knockout mice exhibit multiple organ defects [19]. Several studies aimed at discovering more Sox protein functions have been impaired by extensive functional redundancy among members of the same group and pleiotropy [14].
The Bovidae family includes several economically and socially important ruminant species such as cattle, sheep, goats, and antelope, comprising more than 140 recognized species distributed among 49 genera. The evolution of the different members of this family has been molded by several mechanisms including temperature adaptations, feeding ecology, vegetation physiognomy, climate fluctuations, immigration, adaptive radiations, and mass extinctions [20]. There are published reports examining the evolutionary history of Bovidae through morphological data, allozymes, serum immunology, DNA sequences, and mitochondrial DNA analyses [21]. Protein and DNA-based sequence methods utilize sequence variations and the analysis of conserved domains, in addition to computational methods utilizing genetic variation within and between amino acid sequences to predict functional and structural outcomes [22][23][24][25].
Most computational analyses on Sox genes have focused on individual genes and their regulatory functions rather than the entire gene family. Despite the abundance of information on the importance of Sox proteins in various developmental processes in other species, we found no published report on the Sox gene family in Bovidae. Taking this into consideration along with the fact that Sox genes are significant players in the regulation of developmental processes, we examined the evolutionary dynamics of the Sox gene family within Bovidae by evaluating their biochemical properties, structural prediction, conserved domains, protein-protein interaction, and phylogenetic relationships. Sequence alignment was performed using the ClustalW algorithm on the MEGAX software [26]. The alignment was imported to itol for phylogenetic tree construction using Maximum Likelihood Tree method, as described in [27]. Thereafter, MEME (Multiple Em for Motif Elicitation http://alternate.meme-suite.org accessed on 12 June 2022) version 5.0.2 [28] was used to predict the conserved motif structures encoded across all Bos Sox genes. All sequences were scanned for the presence of the sequence motif RPMNAFMVW, which has been reported to be conserved for all Sox sequences other than Sry and only sequences that have the motif were included in further analysis. We used Multalin software (http: //multalin.toulouse.inra.fr/multalin accessed on 24 April 2022) to identify the conserved amino acid sequences among the species. The sequence logo of the identified domain in the Sox protein family was constructed with WebLogo (http://weblogo.berkeley.edu/logo.cgi, accessed on 24 April 2022).

Physicochemical Characterization of Sox Gene Family
To decipher the physicochemical properties of Sox gene family in Bovidae, we used Expasy ProtParam server (https://web.expasy.org/protparam accessed on 21 June 2022) to compute the molecular weight, isoelectric point (pI), instability index, aliphatic index, and grand average of hydropathicity (GRAVY) of all the proteins in the Bovidae Sox family, as previously described [23]. The chromosomal location of each Sox gene was retrieved from Unitprot/KB/Swiss-Prot database in NCBI (https://www.ncbi.nlm.nih.gov accessed on 10 April 2022).

Identification of Interacting Proteins, Functional Enrichment, and Pathway Analysis
Protein-protein interactions were predicted using the Search Tool for the Retrieval of Interacting Genes database (STRING https://stringdb.org accessed on 7 June 2022). This is important to elucidate the association of Sox proteins with other molecules. Functional enrichment analysis was performed with PANTHER (http://www.pantherdb.org accessed on 5 June 2022) [29] and Database for Annotation, Visualization, and Integrated Discovery (DAVID) to classify the genes according to their function, annotated with ontology terms: biological processes, cellular components, and molecular functions of the studied genes.

Evolutionary Phylogenetic Analysis of Sox Gene Family
To study the evolutionary pattern and inheritance of Sox genes in Bos sp., a multiple sequence alignment was performed using ClustalW tool in MEGA11 [26] with default parameters; all positions containing gaps and missing data were eliminated. The phylogenetic trees of Sox gene family in Bos sp. were constructed by adopting the Neighbor-joining method of MEGA11 software with the following parameters: Poisson correction, pairwise deletion, and 1000 bootstrap replicates. The constructed tree files were visualized using itol (Interactive Tree of Life, https://itol.embl.de accessed on 3 June 2022). itol (Interactive Tree of Life, https://itol.embl.de 3 June 2022).

Physicochemical Properties of Sox Genes
Our analysis of the physicochemical properties revealed that Sox15 h molecular weight of 25,140 Da, while the SoxD family [5,6,13] had the high weight values ( Table 1). The pI is the pH at which a particular molecule electric charge; this value is useful for understanding protein charge stabilit pI ranged from 4.95 to 9.85 ( Table 1). The results for the instability index r genes were all unstable (II > 40), with the lowest value of 41.96 in Sox1 and 80.78 in Sox9. The aliphatic index, regarded as the factor determining the th of globular proteins, was lowest in Sox1 (44.44) and highest in Sox13 (71.8 tinction coefficient and low negative GRAVY values were also observed.

Physicochemical Properties of Sox Genes
Our analysis of the physicochemical properties revealed that Sox15 had the lowest molecular weight of 25,140 Da, while the SoxD family [5,6,13] had the highest molecular weight values ( Table 1). The pI is the pH at which a particular molecule carries no net electric charge; this value is useful for understanding protein charge stability. Overall, the pI ranged from 4.95 to 9.85 ( Table 1). The results for the instability index reveal that the genes were all unstable (II > 40), with the lowest value of 41.96 in Sox1 and the highest of 80.78 in Sox9. The aliphatic index, regarded as the factor determining the thermostability of globular proteins, was lowest in Sox1 (44.44) and highest in Sox13 (71.88). A high extinction coefficient and low negative GRAVY values were also observed.

Evolutionary Dynamics and Phylogenetic Analysis
Furthermore, this study examined the evolutionary pattern of 31 Sox sequences of the Bos sp. The multiple sequence alignment indicated that Lysine (K), Proline (P), Arginine R), Glycine (G), and Leucine (L) are conserved in the Sox family ( Figure 2). The sequences demonstrated significant variability in both percentage identity and similarity in Bos despite their common evolutionary origin. The analysis of the percentage identity and similarity revealed that Sox14 and 21 (71.73%) were highly similar while Sox7 and 30 exhibited the lowest similarity (15.17%) ( Table 2 and Supplementary Table S1) Genes 2022, 13, 1783 7 of 16

Conserved Motif Analysis of Sox Genes
To further characterize the evolutionary pattern of the Sox gene in the Bos species, a conserved motif analysis was performed using the MEME and MEGA programs. The results indicate that 10 conserved motifs are present in the Bos species. Interestingly, we found that conserved motifs 1 and 2 (M1 and M2) exist in all members of the Sox family except Sox17, which had motif 1 only. The other nine motifs were present in the different Sox groups. M1 and M2 are the core HMG boxes, containing 79 amino acid residues. As shown, five motifs are present in SoxD-sox13 and SoxE:Sox9 and 10, and seven motifs in SoxD:Sox5 (Figure 6a). Four motifs exist in SoxE:Sox8 and three motifs in SoxH:Sox30. Interestingly, we found an importantly conserved motif, namely, EFDQYL/ELDQYL, in both SoxE and F including Sox 7-10 and 18 with the exception of Sox 17 (Figure 6b). This motif is crucial for the transactivation and regulation of these protein family and has also been observed in humans. arrangement, with SoxB1 diverging first. The result is further represented in an unrooted tree, with the expected clustering pattern observed for all groups (Figure 3b). Figure 4 shows the MSA of the homology of the HMG domain across the Sox genes showing that Lysine (K), Proline (P), Glutamate (E), Arginine (R), and Leucine (L) are 100% conserved in this region. The sequence logo showing the relative frequencies of each of the conserved domains and their respective positions is displayed ( Figure 5).

Conserved Motif Analysis of Sox Genes
To further characterize the evolutionary pattern of the Sox gene in the Bos species, a conserved motif analysis was performed using the MEME and MEGA programs. The results indicate that 10 conserved motifs are present in the Bos species. Interestingly, we found that conserved motifs 1 and 2 (M1 and M2) exist in all members of the Sox family except Sox17, which had motif 1 only. The other nine motifs were present in the different Sox groups. M1 and M2 are the core HMG boxes, containing 79 amino acid residues. As shown, five motifs are present in SoxD-sox13 and SoxE:Sox9 and 10, and seven motifs in SoxD:Sox5 (Figure 6a). Four motifs exist in SoxE:Sox8 and three motifs in SoxH:Sox30. Interestingly, we found an importantly conserved motif, namely, EFDQYL/ELDQYL, in both SoxE and F including Sox 7-10 and 18 with the exception of Sox 17 (Figure 6b). This motif is crucial for the transactivation and regulation of these protein family and has also been observed in humans.

Characterization of Functional Motifs
All the Sox sequences analyzed were individually scanned for matches against the PROSITE collection of protein signature databases. We found two main domains, HMG box B and the SoxC terminal domains (Figure 7), with varying frequency across the Sox gene family in Bovidae. We found one proline-rich domain in Sox9, 17, 18, 15, and 30. The amino acid composition analysis reveals that these proteins have the highest proline concentration among the studied Sox proteins (Sox9- 16

Biological, Molecular, and Cellular Function of Bos Sox Genes
For the Gene Ontology analysis, three criteria were utilized: the biological processes the genes were involved in, the cellular components they are a part of, and their molecular function. We identified 49 biological processes with a false discovery rate <100, 6 cellular components, and 19 molecular functions (Table S3, Figure 8). The DAVID analysis revealed 26 biological processes, 2 cellular components, and 7 molecular functions (Supplementary Table S4). Nineteen significant biological processes including stem cell fate specification, the positive regulation of mesenchymal cells, renal vesicle induction, retinal rod cell differentiation, metanephric nephron tubule formation, the negative regulation of myoblast differentiation, and the positive regulation of chondrocyte differentiation were concomitantly identified from both databases (Supplentary Tables S3 and S4, Figure 8).

Biological, Molecular, and Cellular Function of Bos Sox Genes
For the Gene Ontology analysis, three criteria were utilized: the biological processes the genes were involved in, the cellular components they are a part of, and their molecular function. We identified 49 biological processes with a false discovery rate <100, 6 cellular components, and 19 molecular functions (Table S3, Figure 8). The DAVID analysis revealed 26 biological processes, 2 cellular components, and 7 molecular functions (Supplementary  Table S4). Nineteen significant biological processes including stem cell fate specification, the positive regulation of mesenchymal cells, renal vesicle induction, retinal rod cell differentiation, metanephric nephron tubule formation, the negative regulation of myoblast differentiation, and the positive regulation of chondrocyte differentiation were concomitantly identified from both databases (Supplentary Tables S3 and S4, Figure 8).

Protein-Protein Interaction Cluster with Sox Genes
In order to analyze the protein-protein interaction (PPI), co-expression, co-regulation, and physical association, we used STRING to build the protein network. Using k-means clustering we generated three clusters composed of closely connected interactions. The results revealed a major PPI network cluster with all the studied genes except Sox12, 15, and 30, which exhibited no interaction (Figure 9a). Sox5 had the most interaction with 10 nodes while Sox11, 13, and 14 had the least interaction. However, Sox1, 2, and 3 were not found in the cluster.

Protein-Protein Interaction Cluster with Sox Genes
In order to analyze the protein-protein interaction (PPI), co-expression, co-regulation, and physical association, we used STRING to build the protein network. Using kmeans clustering we generated three clusters composed of closely connected interactions. The results revealed a major PPI network cluster with all the studied genes except Sox12, 15, and 30, which exhibited no interaction (Figure 9a). Sox5 had the most interaction with 10 nodes while Sox11, 13, and 14 had the least interaction. However, Sox1, 2, and 3 were not found in the cluster.
The PPI also indicated a possible involvement in pathways such as metabolic, developmental, and regulatory. Cellular signaling pathways such as the regulation of the Wnt signaling pathway, the negative regulation of the canonical Wnt signaling pathway, the regulation of the canonical Wnt signaling pathway, and the signal transduction involved in cell cycle checkpoint were associated with these protein networks. The STRING-extended analysis revealed the interaction of specific Sox genes with CTNNB1, HERC1, LEF1, PAX3, and TCF7L1 (Figure 9b).

Discussion
The Sox family of transcription factors exhibit diverse tissue-specific expression patterns during early embryonic development and plays important roles in cell fate [30]. To provide a deeper understanding of the inheritance pattern and the role of Sox protein in The PPI also indicated a possible involvement in pathways such as metabolic, developmental, and regulatory. Cellular signaling pathways such as the regulation of the Wnt signaling pathway, the negative regulation of the canonical Wnt signaling pathway, the regulation of the canonical Wnt signaling pathway, and the signal transduction involved in cell cycle checkpoint were associated with these protein networks. The STRING-extended analysis revealed the interaction of specific Sox genes with CTNNB1, HERC1, LEF1, PAX3, and TCF7L1 (Figure 9b).

Discussion
The Sox family of transcription factors exhibit diverse tissue-specific expression patterns during early embryonic development and plays important roles in cell fate [30]. To provide a deeper understanding of the inheritance pattern and the role of Sox protein in Bovidae evolution, we utilized sequences available in public databases to perform several computational analyses, yielding evolutionary relationships among the Sox gene family in Bos. Our analysis identified nine distinct clades, which were assigned to the known Sox groups A-H [31,32]. The genes within the same subgroup shared greater sequence similarity outside the HMG domain than those that are more related, a finding consistent with a previous report [33]. We observed high percentages of proline, glycine, and alanine in all the Sox proteins. Studies have shown that proline readily adopts a cis and trans configuration in response to subtle influences, induces sharp turns in the local geometry, is important in protein-protein interaction and cell adhesion, and mediates signal transduction [34]. The SoxG and SoxD subfamilies had the lowest and highest amino acid residues and molecular weights, respectively. All the Sox proteins had negative GRAVY values, indicating the hydrophilic nature of the proteins, as previously described [31,35], enhancing their binding capacity [36,37] SoxA (SRY) protein, a transcriptional regulator that controls the genetic switch in male development, is the only member of Group A and formed a monophyletic group in the phylogenetic analysis. Its expression inhibition causes defects in the development of testes [32,38], with cytogenetic and molecular studies revealing that it possibly arose from a mutation of Sox3 and the fusion of the gene with regulatory sequences from another gene already on the X chromosome [39]. STRING pathway analysis revealed its association with Sox9, with GO enrichment confirming that SRY and Sox9 are enriched in male sex determination, which is consistent with published reports [31,40]. SRY binds to the TESCO (testis-specific enhancer core) sequence of Sox9 and initiates the differentiation of somatic precursors into Sertoli cells that coordinate the gonadal development into the testis [41]. Despite the substantial variation in expression profile structure and amino acid sequences within mammals, the function of SRY to activate Sox9 during development appears to be conserved.
The Sox B 1 group comprises Sox1, 2, and 3, which are groups of transcription factors that have been reported to affect the Wnt β-catenin signaling pathway [2]. Sox1 inhibits β-catenin/TCF transcription activity by binding to β-catenin via their C-terminal region while Sox2 overexpression down-regulates the Wnt β-catenin signaling pathway suggesting a negative feedback loop between them. These proteins also promote the self-renewal of neural progenitor cells. PANTHER analysis shows that Sox3 interacts with SRY and Sox9 in sex determination. There was no interaction reported for Sox1 and 2, possibly attributed to a functional redundancy in this subfamily [42,43] The proteins encoded by members of SoxB 2 subfamily, Sox14 and Sox21, are very similar to each other but different from subfamily B 1 in areas outside the HMG domain. They have been implicated in the regulation of mesenchymal stem cell differentiation. Sox21 has been reported to promote neurogenesis in mice while Sox14 mediates terminal differentiation of the neural progenitor cells [44]. The STRING pathway analysis identified Sox14 as one of the genes involved in the regulation of neurogenesis while Sox21 was enriched in stem cell differentiation. The SoxC subfamily (Sox4, 11, and 12) has three members in most vertebrates [3]. Sox4 and 11 share a similar function and structural properties. Reports on Sox12 functional properties are scarce. Sox4 and 11 were shown to be involved in lymphocyte differentiation, osteoblast development, neural and glial cell development, control progenitor development, and the capacity to crosstalk with other cells to grow and mature skeletal structures [1,3]. Phylogenetic analysis revealed that the three proteins are closely related, as they all appeared in one clade. PROSITE revealed a SoxC-rich domain in all SoxF genes, suggesting that the complex interacts with this group. The HMG box domain in this group is virtually identical; therefore, the difference in DNA binding efficiency may be due to sequence differences outside the HMG box [3]. The STRING pathway analysis identified Sox4 and Sox11 as enriched in cardiac ventricle formation and limb bud formation, enteric nervous system development, and glial cell proliferation.
The SoxD subfamily (Sox5, 6, and 13) have higher amounts of amino acid residues than other Sox proteins. Our result showed that this subgroup is significantly enriched in oligodendrocyte differentiation, cell fate commitment, and glial cell differentiation. Members of this group formed a paraphyletic group with Sox30 suggesting a recent divergence. These genes interact with Sox9 of the SoxE subfamily, and expression of the three Sox genes culminates in growth plate proliferation in chondrocytes. Furthermore, the co-inactivation of Sox5 and 6 genes was reported to result in stunted or underdeveloped growth plates and articular cartilage, while Sox9 is required to turn on and maintain chondrocyte-specific genes [45,46].
SoxE proteins, Sox8, 9, and 10, overlap in expression, play critical roles in various biological processes, and are renowned for their role in transactivation. Haseeb and Lefebvre [47] reported a study in humans wherein SoxE and F proteins possess the unique motif EFDQYL/ELDQYL required for transactivation, which we also found in both SoxE and F of the Bovidae family. Based on its amino acid composition, it is functionally crucial and highly conserved in various vertebrate and invertebrate species [48]. The residues within the motif revealed the presence of acidic and hydrophilic amino acids (E, Q) alternating with hydrophobic residues (F, Y, L), which fold into pocket-like structures involved in recognizing and binding to functional partners and coactivators [47]. Other studies also showed that the sequence is required for the interaction of Sox9 and β-catenin, [49] The SoxF subfamily promotes the fate specification of endothelial cells [50]. They are thought to share a recent common ancestor with SoxE and exhibit a substantial overlap in expression function and regulation implying a redundant type of cooperation. The only Sox member in the G subfamily is Sox15. This Sox protein is found only in mammals and shares a common ancestry with SoxB, forming a paraphyletic tree. [14,51,52]. This common ancestry likely accounts for the ability of Sox15 to replace the function of Sox2 in the self-renewal of mouse stem cells [53,54] The STRING analysis revealed the complex interaction of the studied Sox proteins with CTNNB1, PAX3, HERC1, and TCF7L1. The catenin β-1 gene (CTNNB1) provides instructions for making the β-catenin protein, which is primarily found at junctions connecting neighboring cells. β-catenin plays a major role in sticking cells (adhesion), cell signaling, and in cell communication [55]. This pathway is important for Sox proteins to recognize target genes during cellular interactions. CTNNB1 formed complex interactions with Sox2,6,7 and 9 in several developmental processes in Coturnix japonica [31]. Furthermore, CTNNB1 also interacts with Sox17 and 30, LEF1, and TCF7L1 in β-catenin binding. LEF1 (lymphoid enhancer-binding factor 1) shares significant homology with HMG protein and is involved in the Wnt canonical pathway, cell differentiation, and follicle morphogenesis [56]. Pax3, Sox10, and c-Ret are components of a neural crest development pathway, and the interruption of this pathway at various stages results in neural crest-related human genetic syndromes [57].

Conclusions
Our study shows the detailed molecular evolution, inheritance pattern, and functions of Sox genes in the Bos family using several computational tools. In addition, this study documents the first comprehensive evidence of genetic variation in the Sox gene family in Bos sp. We reported a total of 20 Sox genes, which were classified into nine subfamilies based on phylogenetic analysis. We also provided evidence of divergent evolution in some Sox genes among the Bos family. This study further revealed important functional motifs that drive these proteins and enhance their ability to shape the regulatory regions of several genes and promote cells' fate and differentiation. Lastly, we found the motif EFDQYL/ELDQYL, which is required for the unique transactivation ability of SoxE and SoxF proteins. Based on the findings of this study, we have provided a suitable background for further studies to harness the potential of these protein families in immunotherapeutic and regenerative medicine targeted at Bovidae.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes13101783/s1, Table S1: Genetic Distance of Bos Sox gene subfamilies; Table S2: Amino acid composition of Sox genes in Bos sp; Table S3: Gene ontology: biological processes, molecular function and cellular components of Sox genes in Bos sp. (PANTHER); Table S4: Gene ontology: biological processes, molecular function and cellular components of Sox genes in Bos sp. (DAVID).