A Comparative Cross-Platform Analysis to Identify Potential Biomarker Genes for Evaluation of Teratozoospermia and Azoospermia

Male infertility is a global public health concern. Teratozoospermia is a qualitative anomaly of spermatozoa morphology, contributing significantly to male infertility, whereas azoospermia is the complete absence of spermatozoa in the ejaculate. Thus, there is a serious need for unveiling the common origin and/or connection between both of these diseases, if any. This study aims to identify common potential biomarker genes of these two diseases via an in silico approach using a meta-analysis of microarray data. In this study, a differential expression analysis of genes was performed on four publicly available RNA microarray datasets, two each from teratozoospermia (GSE6872 and GSE6967) and azoospermia (GSE145467 and GSE25518). From the analysis, 118 DEGs were found to be common to teratozoospermia and azoospermia, and, interestingly, sperm autoantigenic protein 17 (SPA17) was found to possess the highest fold change value among all the DEGs (9.471), while coiled-coil domain-containing 90B (CCDC90B) and coiled-coil domain-containing 91 (CCDC91) genes were found to be common among three of analyses, i.e., Network Analyst, ExAtlas, and GEO2R. This observation indicates that SPA17, CCDC90B, and CCDC91 genes might have significant roles to play as potential biomarkers for teratozoospermia and azoospermia. Thus, our study opens a new window of research in this area and can provide an important theoretical basis for the diagnosis and treatment of both these diseases.


Introduction
The worldwide decline in human semen quality has placed reproductive genetics at the forefront of scientific research on human reproduction and fertility. Male infertility is a combination of complex reproductive ailments with substantial genetic backgrounds [1]. It is characterized by the failure to achieve successful pregnancy after a year of unprotected power. Gene expression meta-analysis provides incipient biological insights and identifies more precise biomarkers and therapeutic targets [27]. The present study was conducted to identify differentially expressed genes (DEGs) associated with teratozoospermia and azoospermia by performing a meta-analysis on available microarray datasets to understand the common underlying molecular mechanisms. Further, in this clinical condition, we tried to find specific genes to understand the disease mechanism through a protein-protein interaction (PPI) network.

Microarray Data
The National Center for Biotechnology Information-Gene Expression Omnibus (NCBI-GEO) database (http://www.ncbi.nil.nih.gov/geo/, accessed on 1 July 2020) was used to collect suitable gene expression microarray samples [28]. A detailed search was conducted of the GEO database using the individual keywords "teratozoospermia" AND "azoospermia". Two datasets for teratozoospermia, i.e., GSE6872 and GSE6967, and two datasets for azoospermia, i.e., GSE145467 and GSE25518 were included in our study from the NCBI-GEO database, considering their fulfillment of certain criteria. The datasets which did not show any significant genes in GEO2R were excluded from our study. In the case of teratozoospermia, microarray was used for purified spermatozoa obtained from the ejaculate, while in the case of azoospermia, there is the complete absence of spermatozoa so tissues from testes were used for microarray. Testes tissues contain spermatogonial stem cells (SSCs) which produce spermatozoa through the process of spermatogenesis [29]; therefore, there exists a common origin between the cell type of teratozoospermia and azoospermia. The samples required for the study were collected from both healthy controls and patients. In the teratozoospermia study, controls can be defined as normal fertile males who have fathered at least one child, while in the azoospermia study, males with normal spermatogenesis processes can be considered as controls. The samples for the teratozoospermic study were collected from men aged between 21-57 years while in the case of the azoospermic study, the samples were collected from men of reproductive age. The subjects from whom the samples were collected for teratozoospermic and azoospermic study belonged to the American (from USA and Argentina) and European (from Slovenia) populations, respectively. The gene expression profiling was mainly based on abnormal spermatozoa and tissues of testes. The inclusion criteria that were used while choosing the datasets for meta-analyses are mentioned below: (i) the sample type should contain RNA only for both "teratozoospermia" and "azoospermia" datasets, (ii) datasets must not contain intersecting/duplicate data, (iii) datasets must not be generated from the same research laboratory, (iv) datasets must be heterogeneous in terms of microarray platform, and (v) each dataset must contain enough data to carry out a meta-analysis ( Table 1). The datasets that matched those inclusion criteria were selected for the present meta-analyses.  Azoospermia [32] GEO: Gene Expression Omnibus, GPL: GEO Platform, GSE: Genomic Spatial Event.

DEG Screening and Meta-Analyses
ExAtlas meta-analyses software was used to carry out the analysis of microarray expression data [33]. A total of four GEO datasets were included in the study and the expression profiles of those datasets were extracted using the GEO database. The quantile method was used for the standardization of the data [34]. The datasets were downloaded and saved individually and then merged using the batch normalization method. Genespecific batch normalization was used to combine two or more datasets. If two datasets included the same tissue or organ, then the median expression levels for the common tissue/organ were neutralized in the two datasets using this method.
ExAtlas and NIA Array Analysis have the same algorithm for statistical analysis [35]. Gene expression values were converted into a log form and used for the analysis of variance (ANOVA) [35], which was modified for multiple hypotheses testing cases. Moreover, the false discovery rate (FDR) [36] was used to evaluate the importance of gene expression changes in place of p-values. Thereafter, meta-analyses were carried out based on the saved datasets using the random effect method [37] and lists of DEGs were saved as a gene set file. The random effects method considers the variance of heterogeneity among different studies, which is added to the variance of individual effects. Here the term "effect" means the log ratio of gene expression change/difference compared to the control or study-wide mean or median [38].
The raw datasets were simultaneously analyzed with another software named Network Analyst 3.0 [35]. Upon combining the datasets after their standardization, 15,879 feature numbers were identified and then subjected to batch effect adjustment using Combat [34]. Meta-analyses were then performed on the combined dataset using a random effect model with the p-value set to less than 0.05 and FDR to less than or equal to 2. FDR can act as an effective indicator of the strength of a study and the p-value can be useful for statistical power analyses. It can also be used to examine thousands of features, such as all the genes of an organism, and measure their expression related to the above-mentioned diseases. The Limma package of R/Bioconductor was utilized for the recognition of DEGs [39].
In addition, gene expression analyses were performed on all the datasets individually using GEO2R [40]. Quantile standardization was carried out and Benjamini and Hochberg's false discovery rate method [41] was selected by default for GEO2R analysis since it is mostly used for the adjustment of microarray data and also provides a good balance between the discovery of statistically significant genes and limitation of false positives.

Comparative Analyses
A comparison of DEGs from both analyses was carried out and the common genes were marked. The marked genes have an annotation set to the official gene symbol which was then rectified using the db2dbtool of the Biological Database Network (BDN) [42]. In addition, GEO2R was used to generate the gene expression output of all those datasets for comparison [31]. The common DEGs were then marked and also compared with the output of ExAtlas and Network Analyst 3.0. A heatmap was then constructed using the DEGs with the help of the Complex Heatmap Package of R [43].

Protein-Protein Interaction (PPI) Network
DEGs were utilized to carry out the study of PPIs using the STRING application [44]. The protein network file was then opened with Cytoscape software to analyze the core module of the PPI network [11,45]. First (1) shell interactors represent the input proteins that were found to be common between ExAtlas and Network Analyst analyses. No second shell interactors were included in the analysis. The Network Analyzer function of Cytoscape was then used to analyze the generated network. In the network, the nodes represent the proteins, whereas the edges represent the evidence of interactions. The node size is directly proportional to the betweenness centrality value of the particular protein and the node color is based on the degree of connectivity of the different nodes with other neighboring nodes. Nodes with no degree of connectivity were not represented in the network. The difference in the color of nodes was due to their varying degree of connectivity. The highest degree of connectivity was found to be 14, whereas the lowest degree of connectivity was found to be 1.
Furthermore, scatterplots were constructed between the betweenness centrality and closeness centrality of the different nodes and also the betweenness centrality and degree values of the different nodes.

Pathway Enrichment Analyses
The BINGO application of Cytoscape was used to study the biological processes involved with DEGs and functional enrichment analysis [46]. The Benjamini and Hochberg FDR correction was employed to perform a hyper geometric test. For enrichment analyses, the full GO database was selected as the ontology file. The network generated was then analyzed using the network analyzer function of Cytoscape.
The overall workflow used in the study for the identification of potential biomarker genes common to teratozoospermia and azoospermia is represented in Figure 1. It is shown that the three analysis methods, i.e., ExAtlas, Network Analyst, and GEO2R were used for the meta-analyses of the genes where overlapping outputs were obtained. These overlapping outputs were then utilized to study the PPI network using the STRING application and pathway enrichment analyses using the BINGO application ( Figure 1).
Genes 2022, 13, x FOR PEER REVIEW 6 of 24 Figure 1. The diagrammatic illustration showing the workflow for the identification of potential biomarker genes common to teratozoospermia and azoospermia, introducing an in silico approach. Meta-analysis using three different analyses, i.e., ExAtlas, Network Analyst, and GEO2R, resulted in overlapping outputs which were then used to study protein-protein interaction (PPI) and protein enrichment analysis using the STRING and BINGO application, respectively.

Results
Four microarray datasets named GSE6872, GSE6967, GSE145467, and GSE25518 included in the present study ( Table 1)   Meta-analysis using three different analyses, i.e., ExAtlas, Network Analyst, and GEO2R, resulted in overlapping outputs which were then used to study protein-protein interaction (PPI) and protein enrichment analysis using the STRING and BINGO application, respectively.

Results
Four microarray datasets named GSE6872, GSE6967, GSE145467, and GSE25518 included in the present study ( Table 1) altogether consisted of 77 samples, of which, 32 were controls, and the remaining 45 were patient samples. In Figure 2, the distribution of data representing these datasets has been shown with the help of a density plot, which visualizes the distribution of data over a continuous interval or time period, and the peaks of the density plot help to display where the values are concentrated over the interval. In our case, all the curves had their peaks at the interval "0", meaning that all the values have been concentrated at "0" (Figure 2).

Figure 2.
Density plot of the four datasets for teratozoospermia and azoospermia. GSE145467, GSE25518, GSE6872, and GSE6967 have been shown in pink, green, sky, and purple colors, respectively. Figure S1 represents box plots constructed using Geo2R which shows the value plots of these four datasets. The plots demonstrated that the log2 values are normalized across all the samples of each dataset with the median line having more or less equal distribution for each dataset.

Expression of Up-and Down-Regulated Genes (i.e., DEGs)
Meta-analyses of the selected microarray datasets using ExAtlas software revealed 205 significant genes using a random-effect model, of which, 133 were down-regulated and the rest, 72, were up-regulated in the patients compared to healthy controls. Figure 3 represents clustered heatmaps of the four datasets comprising the expression of DEGs. The datasets have been clustered into two groups, namely teratozoospermia and azoospermia, depending upon the expression values of the DEGs. It is clear from Figure 3 that   Meta-analyses of the selected microarray datasets using ExAtlas software revealed 205 significant genes using a random-effect model, of which, 133 were down-regulated and the rest, 72, were up-regulated in the patients compared to healthy controls. Figure 3 represents clustered heatmaps of the four datasets comprising the expression of DEGs. The datasets have been clustered into two groups, namely teratozoospermia and azoospermia, depending upon the expression values of the DEGs. It is clear from Figure 3 that both teratozoospermia and azoospermia follow a similar pattern of gene expression. The effect value in Figure 3 refers to the log ratio of gene expression change/difference compared to the control or study-wide mean or median. and the rest, 72, were up-regulated in the patients compared to healthy controls. Figure 3 represents clustered heatmaps of the four datasets comprising the expression of DEGs. The datasets have been clustered into two groups, namely teratozoospermia and azoospermia, depending upon the expression values of the DEGs. It is clear from Figure 3 that both teratozoospermia and azoospermia follow a similar pattern of gene expression. The effect value in Figure 3 refers to the log ratio of gene expression change/difference compared to the control or study-wide mean or median. Figure 3. Heatmap of the four datasets of teratozoospermia and azoospermia showing the expression of significantly expressed differential genes as shown by R software using the complete heat package of R. Effect value refers to the change of the log ratio of gene expression compared to the control or study-wide mean or median. Teratozoospermia consisted of the datasets GSE6872 and GSE6967, whereas azoospermia consisted of the datasets GSE145467 and GSE25518.
The expression pattern across different samples has been shown with the help of Figure S2. Figure 4 represents the volcano plots of significant genes of the four datasets included in our study. A volcano plot is a type of scatterplot that shows statistical significance (p-value) versus the magnitude of change (fold change). It also allows for a quick visual identification of genes with large fold changes that are also statistically significant. The red dots in the figure represent significantly over-expressed genes, the green dots represent significantly under-expressed genes, and the grey dots represent the genes that were not differentially expressed ( Figure 4).
Network Analyst analyses discovered 1812 DEGs, of which, a total of 118 genes have been found to be common when the results of both ExAtlas and Network Analyst were compared ( Figure 5). The top 25 DEGs from the above-mentioned 118 genes have been listed in Table 2 based on their fold change (FC) values along with their Entrez ID, log-ratio combined, and FDR value. Surprisingly, among all the DEGs, the sperm autoantigenic protein 17 (SPA17) gene has been found to possess the highest fold change value (9.471) from the ExAtlas analysis. This can be considered an important observation since the same gene has been found to have the highest fold change value in the case of Network Analyst analyses. Hence, SPA17 is negatively expressed in teratozoospermia or azoospermia as it remained down-regulated in the disease conditions as compared to the control. Among these top 25 DEGs, 88% of genes (22) were down-regulated, as apparent from their log-ratio combined value, while the rest 12% (3) were up-regulated (Table 2). Therefore, downregulated genes were highly expressed as compared to the up-regulated genes. cluded in our study. A volcano plot is a type of scatterplot that shows statistical significance (p-value) versus the magnitude of change (fold change). It also allows for a quick visual identification of genes with large fold changes that are also statistically significant. The red dots in the figure represent significantly over-expressed genes, the green dots represent significantly under-expressed genes, and the grey dots represent the genes that were not differentially expressed ( Figure 4). Network Analyst analyses discovered 1812 DEGs, of which, a total of 118 genes have been found to be common when the results of both ExAtlas and Network Analyst were compared ( Figure 5). The top 25 DEGs from the above-mentioned 118 genes have been  Figure 6 shows the two-dimensional (2D) principal component analysis (PCA) of the four datasets, i.e., GSE145467, GSE25518, GSE6872, and GSE6967, two of teratozoospermia and two of azoospermia. This plot shows that similar expression profiles have clustered together. It is considered one of the most famous dimension reduction methods where the     A two-dimensional (2D) PCA plot of the four datasets, two of teratozoospermia and two of azoospermia. In this figure, the round shape represents the dataset GSE145467, the triangular shape represents the dataset GSE25518, the square shape represents the dataset GSE6872, and the "+" symbol represents the dataset GSE6967. The pink color shows the control samples while the sky color shows the patient samples.
All four GEO datasets were simultaneously analyzed using GEO2R. The expression profiles contained genes that were significantly expressed in comparison to the control. Following this, the expression profiles of all the datasets were overlapped using the Venn diagram ( Figure 7A). It has been observed that only 13 significantly over-expressed genes Figure 6. A two-dimensional (2D) PCA plot of the four datasets, two of teratozoospermia and two of azoospermia. In this figure, the round shape represents the dataset GSE145467, the triangular shape represents the dataset GSE25518, the square shape represents the dataset GSE6872, and the "+" symbol represents the dataset GSE6967. The pink color shows the control samples while the sky color shows the patient samples.
All four GEO datasets were simultaneously analyzed using GEO2R. The expression profiles contained genes that were significantly expressed in comparison to the control. Following this, the expression profiles of all the datasets were overlapped using the Venn diagram ( Figure 7A). It has been observed that only 13 significantly over-expressed genes (p < 0.05) were common in all four datasets. When these 13 over-expressed genes were compared with the DEGs from ExAtlas and Network Analyst results ( Figure 7B), only two genes, CDC90B and CCDC91, were found to be common to all the three analyses, i.e., ExAtlas, Network Analyst, and GEO2R. The two genes remain down-regulated in the patients having teratozoospermia and azoospermia, as indicated by their negative log ratio combined value. These results shifted our concern towards CCDC90B and CCDC91 genes and prompted our interest in finding the biological function of these as potential biomarkers common to teratozoospermic and azoospermic men, especially in the field of male reproductive health. With respect to Figure 7B, it should be noted that all the genes that have been considered for comparative analyses among the three different softwarebased approaches (ExAtlas, Network Analyst, and GEO2R) demonstrated a significant fold change in the patient sample compared to the control.
Genes 2022, 13, x FOR PEER REVIEW 11 of 24 (p < 0.05) were common in all four datasets. When these 13 over-expressed genes were compared with the DEGs from ExAtlas and Network Analyst results ( Figure 7B), only two genes, CDC90B and CCDC91, were found to be common to all the three analyses, i.e., Ex-Atlas, Network Analyst, and GEO2R. The two genes remain down-regulated in the patients having teratozoospermia and azoospermia, as indicated by their negative log ratio combined value. These results shifted our concern towards CCDC90B and CCDC91 genes and prompted our interest in finding the biological function of these as potential biomarkers common to teratozoospermic and azoospermic men, especially in the field of male reproductive health. With respect to Figure 7B, it should be noted that all the genes that have been considered for comparative analyses among the three different softwarebased approaches (ExAtlas, Network Analyst, and GEO2R) demonstrated a significant fold change in the patient sample compared to the control. The number of common genes obtained by GEO2R from the two teratozoospermia and two azoospermia datasets as visualized by a Venn diagram; 13 genes were found common to teratozoospermia and azoospermia among the four datasets, (B) common genes of individual analyses of the four datasets of teratozoospermia and azoospermia by three different software programs, i.e., ExAtlas, Network Analyst, and GEO2R. Only 2 genes were found common among the individual results obtained across the three analyses, CCDC90B and CCDC91. Figure 8 represents the PPI network for DEGs. Among 118 first shell interactors or the query proteins, 47 proteins with zero degrees of centrality were not considered during the construction of the network, while the remaining 71 proteins were represented by nodes with different colors. The difference in the color of the nodes is due to their varying degree of connectivity. The blue-coloured nodes represent the protein with the highest degree of connectivity, i.e., 14, while green-coloured nodes represent the proteins with the lowest degree of connectivity, i.e., 1. The transition between the green and blue colors shows the different values of degrees of connectivity between 1 and 14. The node size, on the other hand, is directly proportional to the betweenness centrality value of the particular protein (Figure 8). The number of common genes obtained by GEO2R from the two teratozoospermia and two azoospermia datasets as visualized by a Venn diagram; 13 genes were found common to teratozoospermia and azoospermia among the four datasets, (B) common genes of individual analyses of the four datasets of teratozoospermia and azoospermia by three different software programs, i.e., ExAtlas, Network Analyst, and GEO2R. Only 2 genes were found common among the individual results obtained across the three analyses, CCDC90B and CCDC91. Figure 8 represents the PPI network for DEGs. Among 118 first shell interactors or the query proteins, 47 proteins with zero degrees of centrality were not considered during the construction of the network, while the remaining 71 proteins were represented by nodes with different colors. The difference in the color of the nodes is due to their varying degree of connectivity. The blue-coloured nodes represent the protein with the highest degree of connectivity, i.e., 14, while green-coloured nodes represent the proteins with the lowest degree of connectivity, i.e., 1. The transition between the green and blue colors shows the different values of degrees of connectivity between 1 and 14. The node size, on the other hand, is directly proportional to the betweenness centrality value of the particular protein ( Figure 8).

Protein-Protein Interaction (PPI) Network
Genes 2022, 13, x FOR PEER REVIEW 12 of 24 Figure 8. The Protein-Protein Interaction (PPI) network common to teratozoospermia and azoospermia using the STRING application of Cytoscape. The node size is directly proportional to the betweenness centrality value of the particular protein while the node color is based on the degree of connectivity of the different nodes with other neighboring nodes.
A scatterplot was constructed between the betweenness centrality and closeness centrality of the different nodes to visualize the proteins having different values. The betweenness centrality of a node is a measure related to the number of shortest paths the node is involved with and the closeness centrality of a node measures its average farness to all other nodes. The scatterplot between the betweenness centrality and closeness centrality showed that PPP1R36 has the highest value of all the proteins ( Figure 9A). Another scatterplot was also constructed between the betweenness centrality and degree values of the different nodes where degree represents the number of edges linked to it. In this plot, PPP1R36 has the highest value of betweenness centrality, with a degree value of 2, while the protein AURKA has the highest degree value with a betweenness centrality value of 0.263733 ( Figure 9B). A scatterplot was constructed between the betweenness centrality and closeness centrality of the different nodes to visualize the proteins having different values. The betweenness centrality of a node is a measure related to the number of shortest paths the node is involved with and the closeness centrality of a node measures its average farness to all other nodes. The scatterplot between the betweenness centrality and closeness centrality showed that PPP1R36 has the highest value of all the proteins ( Figure 9A). Another scatterplot was also constructed between the betweenness centrality and degree values of the different nodes where degree represents the number of edges linked to it. In this plot, PPP1R36 has the highest value of betweenness centrality, with a degree value of 2, while the protein AURKA has the highest degree value with a betweenness centrality value of 0.263733 ( Figure 9B).
The 20 proteins listed in Table 3 are represented via PPIs without taking secondary interactors into consideration. Table 3 shows the top 20 query nodes, arranged in descending order of their degree of centrality, along with their respective betweenness centrality, closeness centrality, and the average shortest path length. The top 20 proteins in their descending order of degree of connectivity are aurora kinase A (AURKA), thyroid hormone receptor interactor 13 (TRIP13), polo-like kinase 4 (PLK4), disks large-associated protein 5 (DLGAP5), rac GTPase-activating protein 1 (RACGAP1), kinesin-like protein (KIF2C), denticleless E3 ubiquitin protein ligase homolog (DTL), cell division cycle associated 2 (CDCA2), karyopherin α 2 (KPNA2), transforming acidic coiled-coil containing protein 3 (TACC3), meiotic nuclear divisions 1 (MND1), ATPase family AAA domain containing 2 (ATAD2), caveolin 1 (CAV1), dead-box helicase 3 X-linked (DDX3X), eukaryotic translation initiation factor 4A2 (EIF4A2), caspase 1 (CASP1), synaptonemal complex protein 3 (SYCP3), thyroid hormone receptor interactor 12 (TRIP12), testis expressed 15 (TEX15), and serum deprivation-response protein (SDPR). AURKA has topped the list with the highest degree of connectivity (14) followed by TRIP13 and PLK4 with their degree of centrality of 13 and 12, respectively. Centrality can be roughly estimated with the help of the degree of nodes. It can act as an important parameter in a signaling network as it plays a significant role in the estimation of the importance of a node/edge in the flow of information. It also plays an important role in the exploration of drug targets. Among the top three genes having the highest degrees of centrality, TRIP13 (0.383003) has a comparatively higher betweenness centrality value than the remaining two, i.e., AURKA (0.268733) and PLK4 (0.053786). The information flow in a network system can be measured with the help of betweenness centrality. Nodes with a high betweenness centrality can influence the information flow in a biological network which might be helpful as they can act as targets for drug discovery and, hence, are very crucial for network analysis. TRIP13 (0.390977) has the highest value of closeness centrality followed by AURKA (0.379562) and PLK4 (0.348993), respectively. Closeness centrality is another measure that can estimate the rate of flow of information from a given node to another node. On the other hand, TRIP13 (2.557692) has the shortest-path length followed by AURKA (2.634615) and PLK4 (2.865385), respectively. The average shortest-path length measures the accuracy of the information or mass transport occurring on a network. The top 20 interactions from a protein-protein analysis are listed in Table 3. The 20 proteins listed in Table 3 are represented via PPIs without taking secondary interactors into consideration. Table 3 shows the top 20 query nodes, arranged in descending order of their degree of centrality, along with their respective betweenness centrality, closeness centrality, and the average shortest path length. The top 20 proteins in their descending order of degree of connectivity are aurora kinase A (AURKA), thyroid hormone receptor interactor 13 (TRIP13), polo-like kinase 4 (PLK4), disks large-associated protein 5 (DLGAP5), rac GTPase-activating protein 1 (RACGAP1), kinesin-like protein (KIF2C), denticleless E3 ubiquitin protein ligase homolog (DTL), cell division cycle associated 2 (CDCA2), karyopherin α 2 (KPNA2), transforming acidic coiled-coil containing protein 3 (TACC3), meiotic nuclear divisions 1 (MND1), ATPase family AAA domain containing 2 (ATAD2), caveolin 1 (CAV1), dead-box helicase 3 X-linked (DDX3X), eukaryotic translation initiation factor 4A2 (EIF4A2), caspase 1 (CASP1), synaptonemal complex protein 3 (SYCP3), thyroid hormone receptor interactor 12 (TRIP12), testis expressed 15 (TEX15), and serum deprivation-response protein (SDPR). AURKA has topped the list with the highest degree of connectivity (14) followed by TRIP13 and PLK4 with their degree of centrality of 13 and 12, respectively. Centrality can be roughly estimated with the help of the degree of nodes. It can act as an important parameter in a signaling network as it plays a significant role in the estimation of the importance of a node/edge in the flow of information. It also plays an important role in the exploration of drug targets. Among the top three genes having the highest degrees of centrality, TRIP13 (0.383003) has a comparatively higher betweenness centrality value than the remaining two, i.e., AURKA (0.268733)  Table 3. List of top 20 interactions common to teratozoospermia and azoospermia from a proteinprotein analysis using the STRING application of Cytoscape. The genes are arranged in descending order of their degree of centrality, along with their respective average shortest path length, betweenness centrality, closeness centrality, and the clustering coefficient.

Name
Average

Pathway Enrichment Analyses
Gene Ontology (GO) analysis was performed to find out the unique biological significance based on DEGs. In the GO functional enrichment analyses using the BINGO application of Cytoscape (Figure 10), the yellow-coloured nodes have been significantly over-represented while the white-coloured ones are supportive in function. The size of a node is directly proportional to the number of query genes that are annotated to the corresponding GO category.  Table 4 shows the top 12 GO categories based on their respective node sizes which are significantly over-represented in the present study. Among all these significantly over-represented groups, the highest node size has been recorded for cellular processes followed by organelle organization. Neighborhood connectivity has been found to be Figure 10. The enrichment network of the shared differentially expressed genes (DEGs) based on the biological process network of DEGs common to teratozoospermia and azoospermia patients using the BINGO application of Cytoscape. Large nodes represent more genes involved and the size of the node is proportional to the number of targets in the GO category. Yellow-coloured nodes are significantly over-represented while the white-coloured nodes are supportive in function. Table 4 shows the top 12 GO categories based on their respective node sizes which are significantly over-represented in the present study. Among all these significantly overrepresented groups, the highest node size has been recorded for cellular processes followed by organelle organization. Neighborhood connectivity has been found to be highest for the microtubule cytoskeleton and centrosome, followed by the cell cycle process.  EIF4A2, STEAP3, SPON1, USPL1,  HHIP, UBE2D3, STON1, AREG,  BAIAP2L1, CSRP2, TOP1MT,  SPIN1, CASP1, QSOX1, KPNA2,  SPA17, STK32B, DLGAP5, VAV3,  TOR3A, TKTL1, DGAT2, PTGIS,  METTL3, VPS13A, ATP1B2,  TEX15, PASK, PIAS2,  CLDN5CCDC80, HAND2, NUP50,  ADPRH, RHOJ, FAR1, KIF2C,  RPP25, DTL, GTF2A2, PRKAA1,  CDCA2, CUL2, GMPS, TPTE,  CYP17A1, AURKA, AP3M2,  WNT6, PPP1R2, RACGAP1,  STK36, EPB41L3, PLEK2, PLK4,  XRCC6, CAV1, SIAH1, FMO1,  RANBP9, TUBE1, MND1,  SNAP91, HNRNPM, DLC1,  MGAT4A, VPS41, TACC3, TRIP12,  TAF5, TRIP13,

Discussion
Male infertility is a reproductive condition with complex etiopathology affecting more than 20 million men worldwide [2][3][4]. In pathozoospermic men, semen abnormalities associated with impaired spermatogenesis manifest as teratozoospermia, asthenozoospermia, oligozoospermia, and azoospermia [47,48]. Azoospermia represents one of the most severe forms of infertility with as high as a 25% risk of genetic disorders [49]. The classical concept of the percentage of morphologically normal spermatozoa below the World Health Organization (WHO)-stipulated lower reference limit [50] also needs to be revisited based on the proposition to include abnormalities in spermatozoa ultrastructure. These spermatozoa abnormalities at the molecular level may explain the underlying mechanism of teratozoospermia, another important male infertility type [51,52]. Several studies have been carried out in recent times for the identification of potential genetic markers in the case of these two spermatozoa abnormalities [11,53,54]. However, a convincing molecular marker common to both teratozoospermia and azoospermia with significant prognostic value has not yet been identified. The ambiguity in identifying the exact biomarkers of these disorders is also attributed to the lack of potential drug targets to improve these infertility conditions.
In an earlier study by Han et al., teratozoospermia datasets were intensively screened to find three potential biomarkers, namely, AGBL4, FAM172A, and RUNDC3B, in the teratozoospermia patient group [11], while another study identified differentiated genes in the case of patients suffering from azoospermia [24]. In this study, most of the datasets shared 25 DEGs, suggesting that they may play a role in the pathophysiology of male infertility. A total of 8 genes (THEG, SPATA20, ROPN1L, GSTF1, TSSK1B, CABS1, ADAD1, and RIMBP3) were found to be engaged in the overall spermatogenic processes, or at specific stages of spermatogenesis out of the 25 DEGs. They hypothesized that these genes have the potential to be employed as biomarkers for the early diagnosis of nonobstructive azoospermia. However, the potential markers of a single disease cannot unveil the mystery behind the failure of the overall spermatogenic process. Therefore, we made this pioneering attempt to use a rigorous approach to discover differentially expressed genes that are common to both teratozoospermia and azoospermia, with the goal of uncovering certain markers that may play a role in gametogenesis and spermatozoa development and which can be utilized for further downstream processes of identification and subsequent drug discovery.
In the present study, we integrated four datasets, i.e., two for teratozoospermia and two for azoospermia, and successfully identified two genes, CCDC90B and CCDC91, which are commonly affected in both teratozoospermia and azoospermia. Thus, CCDC90B and CCDC91 might be suitable as common candidate biomarkers in the diagnosis and/or treatment of teratozoospermic as well as azoospermic men.
The genetic basis of pathospermia (including teratozoospermia and azoospermia) has been investigated particularly in relation to the expression of miRNAs [6]. A large number of genes have been found to be associated with pathozoospermia, such as the decrease in spermatozoa concentration; however, fewer genes were found to be associated with abnormalities in spermatozoa morphology, i.e., teratozoospermia [47]. Recent studies on the genetics of teratozoospermia have identified recurrent mutations in three specific phenotypes, macrozoospermia, globozoospermia, and multiple morphological abnormalities of the flagella (MMAF) [47]. Several teratozoospermia-associated gene mutations, including F-box only protein 43 (FBXO43) [55], armadillo repeat-containing protein 2 (ARMC2) [56], SEPTIN12 [57], and AGBL carboxypeptidase 5 (AGBL5) [58], have been identified by measuring exonic mutations in blood samples using whole exome sequencing technology. Numerous genes that contribute to various sperm abnormalities have recently been discovered through improvements in sequencing methods, particularly in whole exome sequencing (WES). A homozygous loss-of-function mutation in the zinc finger MYND-type containing 15 (ZMYND15) gene was found in recent research employing WES. It has been demonstrated that the lack of ZMYND15 produces nonobstructive azoospermia and severe oligozoospermia [59]; additionally, another research suggests that it may potentially be linked to teratozoospermia [60]. ZMYND15 has also been described as a switch for haploid gene expression. Proteins such as protein 4.1 [61], spermatogenesis associated 46 (SPATA46) [62], cysteine-rich secretory protein 2 (CRISP2) [63], spermatogenesis associated 6 (SPATA6) [64], and several other genes are believed to play significant roles in the process of spermatogenesis under normal conditions and might act as molecular markers for the clinical diagnosis of pathospermia. The abnormal expression of testicular genes and the loss or mutation of the Y chromosome during spermatogenesis may lead to abnormal sperm morphology, mainly in the expressions of AURKC, SPATA16, Dpy-19 like 2 (DPY19L2), dynein axonemal heavy chain 1 (DNAH1), etc. [65] According to Wang et al., SEPT14 is predominantly expressed in the testes and neurons [23]. It is the last gene to be identified in the SEPTIN family. Two mutations, A123T and I333T, have been found to be associated with teratozoospermic patients after characterizing the genetic effects of SEPT14 in cases with abnormal sperm parameters [23]. Spermatozoa with SEPT14 mutations showed a disruption in the ultrastructure of sperm heads as well as DNA damage. Moreover, the mutation also showed a decrease in the polymerization ability of the spermatozoa [23]. Deleted in azoospermia (DAZ) and testis-specific protein Y-linked 1 (TSPY1) genes have been found to be expressed at the pre-meiotic stage, whereas transition protein 1 (TNP1), protamine 2 (PRM2), synaptojanin 2 (SYNJ2), and zona pellucida binding protein (ZPBP) genes are expressed particularly at the post-meiotic stage [66]. In addition, eight genes, named the testicular haploid expressed gene (THEG), spermatogenesis associated 20 (SPATA20), rhophilin associated tail protein 1 like (ROPN1L), glutathione transferase 1 (GSTF1), testis-specific serine kinase 1B (TSSK1B), calcium-binding protein, spermatid associated 1 (CABS1), adenosine deaminase domain containing 1 (ADAD1), and RIMS-binding protein 3 (RIMBP3), have been found to be either involved in overall spermatogenic processes or at specific phases of spermatogenesis [24]. In azoospermic conditions, the correlation of the inflammation-associated genes with those essential for spermatogenesis revealed that the genes overlapping in inflammation and spermatogenesis might be used as potential biomarkers for azoospermia [67,68]. Thus, the present study, in relation to the individual previous studies, aims to identify SPA17, PPP1R36, AURKA, TRIP13, PLK4, CCDC90B, and CCDC91 genes as important biomarkers of both teratozoospermia-and azoospermia-associated infertility in men. However, CCDC90B and CCDC91 genes were identified as the most notable markers and they might play significant roles in the diagnosis and treatment of these two infertility conditions, paving the way to targeted therapy to cure these forms of male infertility.
There are various gene families that are located on the Y chromosome which have been linked to spermatogenic failure and can lead to both teratozoospermia and azoospermia. However, the connection between the defects in these genes and the ensuing fertility problems are not well understood. Studies in mouse models show that a large number of genes involved in both the repair and monitoring of DNA damage have distinct impacts on gametogenesis during the meiotic transition [69].
Meiotic germ cell loss adds considerably to the relatively low efficiency of human spermatogenesis, according to the findings of an in-depth study on the efficacy of spermatogenesis in humans [70,71]. Investigating the expression of certain genes in humans that are involved in meiotic chromatin dynamics has been proven to be an interesting endeavor. For example, the presence of the mismatch repair gene, muts protein homolog 4 (MSH4), in human tissues suggests that the encoded protein may play a part in human meiosis [72]. In eukaryotic cells, cdc2p (a cyclin-dependent kinase), or one of its orthologs, acts as a master regulator of both the mitotic and meiotic divisions [73]. The puf-8 gene (a pumilio-related gene) is responsible for controlling RNA stabilization and translation and encodes a 'pumilio-like RNA binding protein'. In addition to the other pumilio and FBF (PUF) proteins, the PUF-8 protein is necessary for the maintenance of viable germ cells throughout the development process [74]. Additionally, it performs a non-redundant, partly penetrant function in the testes. Primary spermatocytes that do not express PUF-8 are able to complete the prophase of meiosis I; however, they then leave meiosis, re-enter mitosis, and de-differentiate, producing tumorous germ cells. This finding suggests that PUF-8 is essential for primary spermatocytes to continue progressing along the spermatogenesis pathway after completing meiosis [75]. It has also been explained that an aberrant Y material translocation, including the sex determination region (SRY), to the X chromosome may occur during paternal meiosis. This results in the formation of the 46,XX male chromosomal complement. The presence of the SRY gene does not prevent testicular differentiation, however, spermatogenesis is absent since the long arm of the Y chromosome is missing [76]. These males experience normal sexual development, having no structural abnormalities in external genitalia, but they are more likely to suffer cryptorchidism and hypospadias.
The properties of the relevant chromosomes and the breakpoint sites have a major role in predicting the risk of meiotic imbalance. The typical frequency of paternally generated translocation imbalance at prenatal diagnosis is 12%, and many of these imbalances result in fetal mortality [77,78].
More recently, Han et al. identified the AGBL4 gene which remains significantly upregulated in the spermatozoa of teratozoospermic patients. In their study, the two datasets taken into consideration were GSE6872 and GSE6967. Three common genes were found to be differentially expressed and the expression changes of these differentially expressed genes were further validated using another dataset named GSE6968 [11]. The AGBL4 gene encodes an ATP/GTP binding protein [79] which is a metallocarboxypeptidase that principally mediates the deglutamylation of target proteins, catalyzes the deglutamylation of post-translational polyglutamate side chains in proteins (e.g., tubulin), and removes polyglutamate from the carboxyl terminus of target proteins (e.g., MYLK) [80,81]. To be the best of our knowledge, no study has so far reported the association of AGBL4 with male infertility [82], although results from the differential changes of AGBL4 gene expression proved the feasibility of this gene as a diagnostic marker of clinical teratozoospermia [11]. In contrast to the findings of Han et al. [11], in this study, seven genes have been identified as biomarkers of teratozoospermia and azoospermia, SPA17, PPP1R36, AURKA, TRIP13, PLK4, CCDC90B, and CCDC91. However, the CCDC90B and CCDC91 genes emerged as the most prominent markers common to both teratozoospermia and azoospermia as confirmed by all three analyses, i.e., Network Analyst, ExAtlas, and GEO2R. These genes remained down-regulated in the patients having teratozoospermia or azoospermia as their log ratios combined value was found to be negative after the analysis. Since these genes remained down-regulated, the production of their products, i.e., proteins, would be lower in such patients.
It is quite apparent from Figure 3 that both teratozoospermia and azoospermia share a similar expression patterns of genes, thereby providing a clear idea of some common biomarkers for the two diseases. Figure 3 also indicates that there exists a clear differentiation between the patient and control groups of each dataset in terms of the expression profiles of the genes. The probability that those above-mentioned 205 genes could be considered as significant biomarkers for both teratozoospermia and azoospermia is partly supported by this observation. It is obvious from Table 2 that the SPA17 gene has the highest fold change value from both the ExAtlas and Network Analyst, which makes the gene a strong candidate for a potential common biomarker for both male infertility conditions teratozoospermia and azoospermia. The SPA17gene encodes a protein present at the cell surface and has an N-terminus with a sequence similarity to human cAMP-dependent protein kinase A (PKA) type II α regulatory subunit (RIIa), while the C-terminus has an IQ calmodulin-binding motif. The middle portion of the protein has carbohydrate-binding motifs and plays a significant role in cell-cell adhesion. The protein was initially characterized by its involvement in the binding of spermatozoa to the zona pellucida of the oocyte [83]. Any mutations/changes in the gene would prevent the association of spermatozoa with the oocyte, resulting in the failure of fertilization and ultimately leading to infertility. More recent studies also show its involvement in additional cell-cell adhesion functions such as immune cell migration and metastasis [84]. Additionally, it plays an important role in cell regulation by participating in signaling pathways through its calmodulin-binding site at the C-terminal [85]. Since SPA17 is down-regulated, as evident from its log ratio combined value, its expression would decrease in teratozoospermic and azoospermic patients. In addition, the positive fold change value of 9.471 shows that its negative expression increases 9.471 times. Moreover, SPA17 is involved in the cellular process pathway, as shown in Table 4, so any change in its expression will lead to an alteration of the cellular process as well. It is evident from Figure 9A that the PPP1R36 gene has the highest value of betweenness and closeness centrality, making it an important biomarker for both teratozoospermia and azoospermia. PPP1R36 is highly expressed in testes compared to other tissues. It is expressed during gonadal development, especially in testes during spermatogenesis. PPP1R36 is not only expressed in the developing testes during spermatogenesis but is also present in the acrosome of mature spermatozoa, indicating a role of PPP1R36 in sperm activity, probably through autophagy [86].
In the case of PPI analyses, the top three genes participating in the network were AURKA, TRIP13, and PLK4, based on their degree of centrality. The protein encoded by the AURKA gene is a cell cycle-regulated kinase that is involved in microtubule formation and/or stabilization at the spindle pole during chromosome segregation [87]. The encoded protein is found at the centrosome in interphase cells and the spindle poles in mitosis. This gene might play an important role in tumor development and progression [88]. TRIP13 encodes a protein named thyroid receptor-interacting protein 13 which plays a key role in chromosome recombination and chromosome structure development during meiosis [89]. It is also required at early steps in the meiotic recombination that lead to non-crossovers pathways. Moreover, the protein also aids in the efficient completion of homologous synapsis by influencing crossover distribution along the chromosomes, affecting both crossovers and non-crossovers pathways [90]. More importantly, the protein is required for the efficient synapsis of the sex chromosomes and sex body formation [91]. The PLK4 gene encodes a member of the polo family of serine/threonine protein kinases [92]. The protein localizes to centrioles, complex microtubule-based structures found in centrosomes, and regulates centriole duplication during the cell cycle [93]. CCDC90B (Coiled-Coil Domain Containing 90B) is a protein-coding gene. Diseases associated with CCDC90B include oculoauricular syndrome and osteochondrosis [94]. CCDC91 (Coiled-Coil Domain Containing 91), a protein-coding gene, is associated with diseases such as ossification of the posterior longitudinal ligament of the spine and diffuse idiopathic skeletal hyperostosis (https://www.genecards.org/cgi-bin/carddisp.pl?gene=CCDC91 accessed on 2 January 2022).

Conclusions
In conclusion, the present study has identified 118 DEGs common to the four profile datasets (two belonging to both of teratozoospermia and azoospermia) based on ExAtlas and Network Analyst results. A number of DEGs have been found to be common to both teratozoospermia and azoospermia and may have a diagnostic role in both clinical conditions that may lead to infertility. Among all the DEGs, the significant genes are SPA17, CCDC90B, and CCDC91. The 118 DEGs, after comparison with GEO2R software, showed only two genes, CCDC90B and CCDC91, to be common in the three analyses, i.e., ExAtlas, Network Analyst, and GEO2R. Therefore, it can be said that CCDC90B and CCDC91 genes could be the potential common biomarker candidates in the pathospermic conditions of both teratozoospermia and azoospermia.
The significantly enriched pathways based on the above-mentioned genes are mainly focused on cell cycle and development processes. These observations could significantly improve our understanding of the causes and underlying molecular mechanisms in teratozoospermia and azoospermia. However, further in vivo analysis of these markers is needed to prove their potentiality and establish their effectiveness as potential drug targets.

Conflicts of Interest:
The authors declare no conflict of interest.