Identifying Key Regulators of Keratinization in Lung Squamous Cell Cancer Using Integrated TCGA Analysis

Simple Summary Keratins are the intermediate filament-forming proteins of epithelial cells that support cell structure and tissue homeostasis. In lung squamous cell cancer, keratins are frequently used as diagnostic tumor markers. However, several studies have shown evidence of keratin’s roles in cancer invasion, metastasis, and therapy resistance. In this study, we aimed to identify key regulators of the cancer-related keratinization process in LUSC. Our findings may help others to gain insight into the cancer-related keratinization process and to find potential targets for diagnostics and therapy for LUSC. Abstract Keratinization is one of lung squamous cell cancer’s (LUSC) hallmark histopathology features. Epithelial cells produce keratin to protect their integrity from external harmful substances. In addition to their roles as cell protectors, recent studies have shown that keratins have important roles in regulating either normal cell or tumor cell functions. The objective of this study is to identify the genes and microRNAs (miRNAs) that act as key regulators of the keratinization process in LUSC. To address this goal, we classified LUSC samples from GDC-TCGA databases based on their keratinization molecular signatures. Then, we performed differential analyses of genes, methylation, and miRNA expression between high keratinization and low keratinization samples. By reconstruction and analysis of the differentially expressed genes (DEGs) network, we found that TP63 and SOX2 were the hub genes that were highly connected to other genes and displayed significant correlations with several keratin genes. Methylation analysis showed that the P63, P73, and P53 DNA-binding motif sites were significantly enriched for differentially methylated probes. We identified SNAI2, GRHL3, TP63, ZNF750, and FOXE1 as the top transcription factors associated with these binding sites. Finally, we identified 12 miRNAs that influence the keratinization process by using miRNA–mRNA correlation analysis.


Introduction
Lung squamous cell carcinoma (LUSC) accounts for 20% of all lung cancer diagnoses and is the second most common subtype of lung cancer [1]. One of the key histological features of LUSC is keratinization [2]. When epithelial cells are keratinized to form a keratin layer, a unique program of terminal differentiation and apoptotic cell death follows [3]. Keratinization is an example of epithelial cell adaptation to protect cell integrity from environmental influences, such as physical damage, infection, or xenobiotic substances [4]. In addition to their role as epithelial cell protectors, keratins have other important roles as cellular function regulators, such as apical-basal polarization, cell size determination, protein translation, and organelle position regulation [5].
The overall pipeline in our study is depicted in Figure 1. First, we downloaded and analyzed the messenger RNA (mRNA), methylation, and microRNA (miRNA) profiles from the GDC-TCGA LUSC database. We utilized the Singscore individual sample gene set scoring approach [12] for classifying the samples into three groups (low, medium, and high keratinization) based on their keratinization signature scores. Then, we performed differential expression analyses (DEA) of the transcriptomic, methylomic, and miRNA profiles of these groups. Finally, we identified the key regulatory genes, transcription factors (TFs), and miRNAs using network analysis [9,13], the ELMER algorithm [14], and miRNA-mRNA correlation analysis [11], respectively.

Data Acquisition and Preparation
The gene expression, methylation beta value, and miRNA isoform data were obtained from the GDC-TCGA harmonized database using the Bioconductor package TCGAbiolinks [15]. The gene expression data consisted of 502 LUSC primary tumor samples and 49 normal tissue samples. Then, we paired the samples from the methylation and miRNA expression data with their corresponding samples from the gene expression data. We removed the data that have no corresponding paired samples in gene expression data. Finally, we only kept the primary tumor samples' methylation and miRNA expression data.
Next, we filtered non-relevant data from gene, methylation, and miRNA data for further analysis. The genes with low counts across most samples were discarded from the analysis. Here, we retained genes that had a transcript count per million >1 across over 50% of the samples. We discarded the genes that had identical gene names to enforce unique mapping. For methylation data, we removed the probes with at least one missing value and removed the probes in chromosomes X, Y, and NA. We used only mature miRNA expression data for miRNA analysis. Overall pipeline of the analysis in our study. We divided the LUSC samples using singlesample gene set scoring, Singscore, with keratinization-related gene sets. Then, we performed differential expression analyses (DEA) of gene, methylation, and miRNA expression between high keratinization and low keratinization groups. Differentially expressed gene networks, ELMER, and mirTarRNASeq analysis were performed to identify key regulatory genes, transcription factors, and miRNA.

Data Acquisition and Preparation
The gene expression, methylation beta value, and miRNA isoform data were obtained from the GDC-TCGA harmonized database using the Bioconductor package TCGAbiolinks [15]. The gene expression data consisted of 502 LUSC primary tumor samples and 49 normal tissue samples. Then, we paired the samples from the methylation and miRNA expression data with their corresponding samples from the gene expression data. We removed the data that have no corresponding paired samples in gene expression data. Finally, we only kept the primary tumor samples' methylation and miRNA expression data.
Next, we filtered non-relevant data from gene, methylation, and miRNA data for further analysis. The genes with low counts across most samples were discarded from the analysis. Here, we retained genes that had a transcript count per million >1 across over 50% of the samples. We discarded the genes that had identical gene names to enforce unique mapping. For methylation data, we removed the probes with at least one missing value and removed the probes in chromosomes X, Y, and NA. We used only mature miRNA expression data for miRNA analysis.

Data Single-Sample Scoring and Clustering
We downloaded the Gene Ontology (GO) gene sets from the GSEA database (https://www.gsea-msigdb.org accessed on 6 November 2022) and selected the gene sets related to the keratinization process or keratinocyte differentiation/development. The gene sets and their brief descriptions are listed in Table 1.
We used the Singscore method to score individual samples against these 10 keratinization-related gene sets [12]. In Singscore methods, genes are sorted according to their transcript levels, with upregulated sets being ranked in ascending order and Overall pipeline of the analysis in our study. We divided the LUSC samples using singlesample gene set scoring, Singscore, with keratinization-related gene sets. Then, we performed differential expression analyses (DEA) of gene, methylation, and miRNA expression between high keratinization and low keratinization groups. Differentially expressed gene networks, ELMER, and mirTarRNASeq analysis were performed to identify key regulatory genes, transcription factors, and miRNA.

Data Single-Sample Scoring and Clustering
We downloaded the Gene Ontology (GO) gene sets from the GSEA database (https: //www.gsea-msigdb.org accessed on 6 November 2022) and selected the gene sets related to the keratinization process or keratinocyte differentiation/development. The gene sets and their brief descriptions are listed in Table 1.

Biological Process KERATINOCYTE DEVELOPMENT
The process whose specific outcome is the progression of a keratinocyte over time, from its formation to the mature structure.

Biological Process KERATINOCYTE DIFFERENTIATION
The process in which a relatively unspecialized cell acquires specialized features of a keratinocyte. We used the Singscore method to score individual samples against these 10 keratinizationrelated gene sets [12]. In Singscore methods, genes are sorted according to their transcript levels, with upregulated sets being ranked in ascending order and downregulated sets in descending order. Subsequently, rank-based statistics are utilized to determine the scores for each individual sample [12]. Hierarchical clustering with complete linkage was performed to divide the samples into subgroups with different degrees of keratinization according to the keratinization signature scores. The optimal number of clusters was determined using the R package NbClust, which provides 30 indices that determine the number of clusters.

Differential Expression Analysis (DEA) of the Genes
We performed DEA of the tumor samples in high keratinization versus low keratinization groups for identifying key regulators of the keratinization process in LUSC. We preprocessed the gene expression data using the TCGAbiolinks package and workflow from Silva et al. [16]. In short, we removed outliers, failed hybridization, or mistracked samples by performing Array-Array Intensity Correlation using the TCGAanalyze_Preprocessing function. Next, we normalized mRNA transcript samples using the TCGAanalyze_Normalization, which encompasses the functions of the EDASeq package. Finally, we filtered genes with low signals across samples using TCGAanalyze_Filtering functions. The function TCGAan-alyze_DEA was applied to identify the differentially expressed genes (DEG) between high keratinization and low keratinization groups. We defined the genes with the absolute log fold change ≥1 and FDR < 0.01 as the significant DEGs.

DEGs Network Reconstruction and Analysis
We modified the workflow from our previous work to perform DEGs network reconstruction and analysis [9]. In summary, we performed log(1 + x) transformation to the significant DEG's expression matrix before we inputted it into the network reconstruction algorithm. We used the GRNBoost2 algorithm implemented in the Python package Arboreto to reconstruct the DEG regulatory network [17]. Briefly, this algorithm involves training a gradient boosting machine model for each differentially expressed gene in the dataset to predict its expression profile using the expression values of a set of candidate transcription factors (TFs). Each model produces a partial gene regulation network with regulatory associations from the best-predicting TFs to the target gene. All regulatory associations are combined and sorted by importance to finalize the regulatory network output.
Community detection of the network was performed using the Leiden algorithm implemented in the Python package leidenalg [18]. The community Reactome pathway-based analysis was performed using g:Profiler [19]. Then, we calculated the node betweenness centrality and degree using the python-igraph package. We performed Pearson's correlation test between the expression of the genes with high centrality or high out-degree index and the keratin genes. We adjusted the p-value using Bonferroni correction and considered an adjusted p-value < 0.05 as significant.

Methylation Motif and Regulatory Transcription Factor Identification
First, we compared the mean DNA methylation levels across the three groups. We performed ANOVA test to determine if there was any difference between the means of different groups. We considered ANOVA test p-value < 0.05 as significant.
Next, we used ELMER analysis workflow to identify the enriched motifs for the probes that were significantly differentially hypomethylated in the high keratinization samples relative to low keratinization samples [10]. ELMER workflow consists of 5 main steps: (1) identifying distal probes on Infinium Human Methylation 450K arrays, (2) selecting significantly hypomethylated probes by comparing the methylation level of each probe for all samples in group 1 compared to all samples in group 2, using an unpaired onetailed t-test, (3) identifying putative target genes for differentially methylated probes by performing Mann-Whitney U test for each candidate probe-gene pair, (4) identifying enriched motifs of hypomethylated probes using Fischer's exact test, and (5) identifying regulatory transcription factors (TFs) whose expression is associated with TF binding motif DNA methylation by performing Mann-Whitney U test for each candidate motif-TF pair. A motif was considered significantly enriched if the 95% confidence interval of the odds ratio was greater than 1.1. The regulatory TFs were ranked by their p-value and those in the top 5% of the smallest p-value were considered candidates for upstream regulators.

miRNA-mRNA Relationships Analysis
We used mature miRNA count data to perform DEA of miRNA by TCGAbiolinks package. We filtered the miRNAs with low signals across samples using TCGAanalyze_Filtering functions. The function TCGAanalyze_DEA was applied to identify the differentially expressed miRNA between high keratinization and low keratinization groups. The absolute log fold change ≥1 and FDR < 0.01 were used as the thresholds to select significantly differentially expressed miRNAs. We inputted the log fold change of significant differentially expressed genes from previous DEG analysis and miRNAs to the R Bioconductor package mirTarRnaSeq to identify some significant miRNA-mRNA correlations [11]. The mirTarRnaSeq approach estimated the difference between the miRNA-mRNA fold change, followed by the generation of a background distribution that represents random differences in fold chance. Then, it ranked the difference values against the background distribution to obtain the p-value, FDR, and test statistics estimates. Finally, we intersected our significant miRNA-mRNA correlations result with miRanda database binding predictions [20].

Source Code
The complete source code and the parameter details needed to reproduce our study have been stored in the public repository (https://github.com/yusri-dh/Keratinization_ LUSC accessed on 19 December 2022).

Hierarchical Clustering Based on Single-Sample Scoring against Keratinization-Related Gene Set Identifies Three LUSC Phenotypes
Based on the single-sample scores against the 10 keratinization-related gene sets, we hierarchically clustered LUSC samples into three groups: high keratinization, medium keratinization, and low keratinization ( Figure 2). We selected K = 3 as the optimal number of clusters based on the result of the NbClust analysis. The high keratinization group was characterized by the enrichment in cornification and keratinization gene sets. In contrast, low keratinization samples have an inverse enrichment in cornification and keratinization gene sets. They also have a lower expression of genes related to keratinocyte differentiation and the apoptotic process than samples in the high keratinization group. All normal samples were in the low keratinization group. We compared the samples in high and low keratinization samples in subsequent analyses for identifying key regulators of the keratinization process in LUSC.

DEGs Network Reconstruction and Analysis
We identified 1287 significant DEGs (FDR-adjusted p < 0.01) by comparing the mRNA expression profiles between the high keratinization and low keratinization groups (Table S1). To gain a deeper understanding of the functions and interactions of DEGs, we used the GRNBoost2 network reconstruction implemented in the Arboreto package [17]. The inferred network had 1097 nodes and 9145 edges (Figure 3a). The nodes' properties and edge list are provided in Tables S2 and S3, respectively. Using the Leiden algorithm, we identified 10 major communities in the keratinization DEGs network. We named the largest community Community 0, the second largest Community 1, and so on. The Reactome pathway-based enrichment analysis showed that Community 0 and 3 mainly included keratinization-related genes. Community 1 and 2 networks were enriched in biological oxidation and detoxification-related genes; Community 4 contained immune system genes; and Community 5 contained genes that were responsible for surfactant metabolism (Figure 3b). From the centrality analysis, we found that SOX2, FOXE1, and SYNE1 were the genes with the highest betweenness centrality (Table 2). Meanwhile, TP63, CD53, and NCKAP1L were the genes with the highest node out-degree index ( Table 2). We selected the genes with the highest betweenness centrality and out-degree index to perform Pearson's correlation test with keratin genes. The TP63 genes had a strong and significant positive correlation (Pearson's R > 0.5, adjusted p < 0.05) with several keratin genes: KRT5, KRT6A, KRT6B, KRT13, KRT14, KRT15, KRT16, and KRT17 ( Figure S1a). Meanwhile, the SOX2 gene had a medium and significant positive correlation (Pearson's R > 0.3, adjusted p < 0.05) with KRT5, KRT6A, KRT6B, KRT13, KRT15, and KRT19 ( Figure S1b).
The complete source code and the parameter details needed to reproduce our study have been stored in the public repository (https://github.com/yusri-dh/Keratiniza-tion_LUSC accessed on 19 December 2022).

P63, P73, and P53 Were the Top Three Enriched Motifs for Hypomethylated Probes
Through ANOVA analysis, we found no significant differences in overall DNA methylation levels across the three groups (Figure 4a). Then, using an ELMER analysis, we found 284 gene-probe pairs that were significantly hypomethylated in high keratinization samples with FDR-adjusted p < 0.001. Figure 4b displays the methylation levels of the identified significant probes, along with the expression patterns of the corresponding gene pairs. A list of all significant gene-probe pairs and their p-values is given in Table S4. From all significant pairs, we could identify enriched motifs for the probes that are significantly hypomethylated in high keratinization samples relative to low keratinization samples (Figure 4c). Then, ELMER identified the master regulator TFs corresponding to each of the motifs enriched in the previous analysis step. From all enriched motifs, we could only obtain the significant regulatory TFs associated with the top three binding motifs: the P63, P73, and P53 motifs. The top 5% of TFs associated with the binding motifs are listed in Table 2.

The miRNA-mRNA Relationships Analysis
The mirTarRnaSeq results showed that there were 913 miRNA-mRNA pairs that had a significant relationship (FDR-adjusted p < 0.05). From these 913 miRNA-mRNA pairs, there were 78 pairs that intersected with the miRanda database findings. Figure 5 shows these miRNA-gene pairs and their relationship strength in the unit of absolute log fold difference. Out of all the miRNAs, hsa-miR-375-3p and hsa-miR-9-5p had the greatest number of pairs (16 and 13 pairs, respectively). Whereas, NRTK2 and SERPINB13 were the genes that were affected by the highest number of miRNAs. The regulatory miRNAs are listed in Table 2.

The miRNA-mRNA Relationships Analysis
The mirTarRnaSeq results showed that there were 913 miRNA-mRNA pairs that had a significant relationship (FDR-adjusted p < 0.05). From these 913 miRNA-mRNA pairs, there were 78 pairs that intersected with the miRanda database findings. Figure 5 shows these miRNA-gene pairs and their relationship strength in the unit of absolute log fold difference. Out of all the miRNAs, hsa-miR-375-3p and hsa-miR-9-5p had the greatest number of pairs (16 and 13 pairs, respectively). Whereas, NRTK2 and SERPINB13 were the genes that were affected by the highest number of miRNAs. The regulatory miRNAs are listed in Table 2.

Discussion
Several methods for scoring individual samples against gene sets have been developed, including ssGSEA (single-sample gene set enrichment analysis) [21], GSVA (gene set variation analysis) [22], and Singscore [12]. These frameworks have been used for dimensional reduction, clustering, and condensing information from transcriptomic data [23][24][25]. In this study, we used Singscore to classify LUSC samples into three groups (i.e., low, medium, and high keratinization) based on their keratinization signature scores. The high keratinization group demonstrated the upregulation of genes involved in the cornification, keratinization, and keratinocyte apoptotic processes, whereas these gene sets were underrepresented in the low keratinization group. Using differential analyses of high and low keratinization groups, we investigated the genes, transcription factors, and miRNA that act as key regulators of the LUSC keratinization process.
To identify keratinization core regulatory genes, we employed the same network analysis approach utilized in our prior research [9]. In this approach, we inferred and reconstructed the DEGs network, which provided us with a blueprint of the gene-gene interactions in cancer. Here, we calculated the nodes' out-degree index and betweenness centrality to investigate the roles of some nodes and their impact on the networks. The out-degree index of a node is the number of edges that are going out from the node. Narang et al. showed that the important TFs have a high out-degree index, implying that TFs usually target a large number of genes [13]. In our study, TP63 had the highest out-degree index. TP63 is highly expressed in the basal compartment of the lung airways and is required for progenitor cell development, epidermal stratification, and to maintain the proliferative potential of basal keratinocytes during homeostasis [26][27][28]. In the gene regulation network, the most important nodes are not always the ones with the most edges, but the ones that connect groups or have the most control over the flow of information. The

Discussion
Several methods for scoring individual samples against gene sets have been developed, including ssGSEA (single-sample gene set enrichment analysis) [21], GSVA (gene set variation analysis) [22], and Singscore [12]. These frameworks have been used for dimensional reduction, clustering, and condensing information from transcriptomic data [23][24][25]. In this study, we used Singscore to classify LUSC samples into three groups (i.e., low, medium, and high keratinization) based on their keratinization signature scores. The high keratinization group demonstrated the upregulation of genes involved in the cornification, keratinization, and keratinocyte apoptotic processes, whereas these gene sets were underrepresented in the low keratinization group. Using differential analyses of high and low keratinization groups, we investigated the genes, transcription factors, and miRNA that act as key regulators of the LUSC keratinization process.
To identify keratinization core regulatory genes, we employed the same network analysis approach utilized in our prior research [9]. In this approach, we inferred and reconstructed the DEGs network, which provided us with a blueprint of the gene-gene interactions in cancer. Here, we calculated the nodes' out-degree index and betweenness centrality to investigate the roles of some nodes and their impact on the networks. The out-degree index of a node is the number of edges that are going out from the node. Narang et al. showed that the important TFs have a high out-degree index, implying that TFs usually target a large number of genes [13]. In our study, TP63 had the highest out-degree index. TP63 is highly expressed in the basal compartment of the lung airways and is required for progenitor cell development, epidermal stratification, and to maintain the proliferative potential of basal keratinocytes during homeostasis [26][27][28]. In the gene regulation network, the most important nodes are not always the ones with the most edges, but the ones that connect groups or have the most control over the flow of information. The betweenness centrality measures the number of times a node lies on the shortest path between other nodes. This measure shows which nodes act as bridges between nodes and have significant control over the information flow in the gene regulatory network [9,13]. We found SOX2 to be the gene with the highest betweenness centrality. Moreover, it is also in the top six highest out-degree genes, indicating that SOX2 has important roles in the LUSC keratinization process. SOX2 is a transcription factor that promotes the development and maintenance of squamous epithelium and is an essential regulator of pluripotent stem cells [29]. Because of their adjacent chromosomal localization (3q), SOX2 and TP63 are frequently co-amplified in cancer [29]. TP63 and SOX2 collaborate to regulate multiple genes involved in squamous carcinogenesis [29]. Our study found that TP63 and SOX2 had significant medium and strong correlations with several keratin genes, such as KRT5, KRT6A, KRT6B, KRT13, KRT14, KRT15, and KRT17. This finding indicated that TP63 and SOX2 have a direct influence on the keratinization process by modulating keratinization genes in LUSC.
The community gene set enrichment analysis of the DEGs network can reveal the important gene interactions and processes that influence keratinization. We found that the genes related to detoxification (i.e., biological oxidations and metallothioneins binding to metals), the immune system, and surfactant metabolism were connected with the keratinization process in LUSC. Lung epithelial cells are one of the first lines of defense that interact directly with potentially harmful substances. Smoking, as the primary LUSC risk factor, may be responsible for the detoxification-keratinization relationship [8]. Some studies have highlighted the roles of keratins as regulators of inflammation and immunity in epithelia [30][31][32][33]. The keratin 76 downregulation enhanced the accumulation of T regulatory cells, leading to a drop in the anti-tumor response in oral squamous cell carcinoma [33]. In basal cell carcinoma, the genetic ablation of keratin 17 decreased inflammation and polarized the inflammatory response towards T-helper-2-cells. However, the detailed mechanism of how keratinization and the immune system are interrelated in LUSC remains poorly understood. To the authors' knowledge, no papers have investigated the direct relationship between keratinization and surfactant. In the lung, the surfactant has important roles as a regulator of the immune system and homeostasis [34,35]. Further study is necessary to elucidate the relationship between keratinization and surfactant metabolism.
In addition to genetic changes, many DNA methylation alterations are associated with the LUSC pathophysiology process. We found that the P63, P73, and P53 binding motifs were enriched for hypomethylated probes in high keratinization samples. P73 and P63 are two homologs of P53 and have a high degree of structural symmetry with P53. Thus, P73 and P63 can bind to most of the p53-responsive binding sites, and vice versa [36]. SNAI2, GRHL3, TP63, ZNF750, and FOXE1 were the top TFs associated with the P63, P73, and P53 binding motifs in our study. These transcription factors have important roles in epithelial differentiation and keratinization [37][38][39][40][41].
MicroRNAs, or miRNAs, are small non-coding RNAs that play important roles in post-transcriptional gene regulation [42,43]. These miRNAs may directly regulate keratin proteins. For example, hsa-miR-3074-5p, which targets KRT13 and KRT6B, or hsa-miR-9-5p, which targets KRT6C, KRT13, and KRT5. Other miRNAs regulate keratinization-related processes, such as hsa-miR-149-5p, hsa-miR-192-5p, and hsa-miR-378a-3p, which target SERPINB13, or hsa-miR-3074-5p, which target KRTDAP. Downregulation of the SERPINB13 protein was significantly associated with keratinocyte/epithelial cell differentiation and apoptosis [44]. Meanwhile, KRTDAP is a gene that plays important roles in the epithelium stratification process [45]. In our analysis, we found that hsa-miR-375-3p and hsa-miR-9-5p had the greatest number of pairs with other DEGs. This suggests that the regulation of the LUSC keratinization process is vastly influenced by hsa-miR-375-3p and hsa-miR-9-5p. The expression of hsa-miR-375-3p has been shown to have potential as a promising diagnostic marker in oral cancer [46], breast cancer [47], and head and neck cancer [48]. Additionally, a study showed that hsa-miR-375-3p may have a suppressor role in bladder cancer via the Wnt/beta-catenin pathway [49]. In the case of hsa-miR-9-5p, some studies have demonstrated that hsa-miR-9-5p has potential as a biomarker of therapy response in nasopharyngeal carcinoma [50], cervical cancer [51], and leukemia [52]. To the best of the authors' knowledge, there have been no experimental studies conducted to explore the involvement of hsa-miR-375-3p and hsa-miR-9-5p in the keratinization process or to evaluate their potentials as diagnostic biomarkers, making it an interesting subject for future studies.

Conclusions
The single-sample scoring approach was used in our study to classify the sample into three groups based on their keratinization signatures. Through a comparison of high keratinization versus low keratinization samples using DEA, we were able to identify several transcription factors (TFs), binding motifs, and miRNAs that are likely involved in regulating the keratinization process of lung squamous cell carcinoma (LUSC). Specifically, we emphasized the importance of the TP63 and SOX2 genes, the P63, P73, and P53 DNA binding motifs, and the hsa-miR-375-3p and hsa-miR-9-5p miRNAs as potential key regulators of keratinization in LUSC. Further studies of these regulators might help researchers acquire a deeper understanding of the keratinization process in LUSC and find potential novel biomarkers.