1. Introduction
Primary open-angle glaucoma (POAG) is a chronic, progressive optic neuropathy and represents the leading cause of irreversible blindness worldwide [
1]. It is characterized by the gradual loss of retinal ganglion cells and optic nerve damage, often accompanied by elevated intraocular pressure (IOP), although disease progression may continue despite effective IOP control [
2]. Epidemiological studies estimate that the global number of individuals affected by glaucoma will exceed 110 million by 2040, highlighting its growing public health burden [
3].
Accumulating evidence indicates that POAG is a multifactorial disease involving oxidative stress, extracellular matrix remodeling of the trabecular meshwork, immune dysregulation, and metabolic abnormalities rather than a purely mechanical disorder driven by IOP elevation alone [
4,
5]. In particular, immune-mediated neuroinflammation has emerged as a key contributor to optic nerve degeneration in glaucoma, with both innate and adaptive immune responses implicated in the progression of the disease [
6,
7].
Recent advances in high-throughput transcriptomic profiling have enabled systematic characterization of molecular alterations in glaucomatous tissues [
8]. Bulk transcriptomic studies have revealed extensive gene expression remodeling in the trabecular meshwork and optic nerve head, providing insights into disease-associated pathways [
9]. However, bulk analyses are inherently limited by cellular heterogeneity and cannot fully resolve cell type-specific regulatory programs. In this context, single-cell RNA sequencing (scRNA-seq) has emerged as a powerful approach to overcome these limitations by revealing previously unrecognized cellular diversity and distinct transcriptional states, thereby providing a higher-resolution framework for elucidating the molecular mechanisms underlying glaucoma [
10,
11].
In parallel, increasingly appreciated is the importance of metabolic regulation in disease-associated transcriptional control. At the epigenetic level, histone lysine lactylation, a recently identified post-translational modification derived from cellular metabolism, provides a mechanistic link between metabolic state and gene regulation [
12]. Although lactylation has been implicated in immune regulation and disease pathogenesis in multiple systems, its potential involvement in glaucoma-related transcriptional networks remains largely unexplored [
13].
Machine learning approaches have increasingly been applied to ophthalmic research and have demonstrated strong potential for disease classification and biomarker discovery in glaucoma [
14]. By integrating transcriptomic features with advanced computational modeling, machine learning provides a powerful framework for identifying candidate diagnostic markers from high-dimensional biological data.
In the present study, we applied an integrative analytical framework combining bulk transcriptomics, weighted gene co-expression network analysis, immune deconvolution, machine learning-based classification modeling, and single-cell contextual analysis to systematically investigate lactylation-associated molecular signatures and their diagnostic relevance in POAG. This multi-layered approach aimed to identify potential diagnostic biomarkers and to provide a more comprehensive view of the immune and metabolic dysregulation underlying POAG pathogenesis.
Therefore, the present study integrates lactylation-associated genes with diagnostic modeling and cellular resolution validation to investigate molecular heterogeneity in POAG. Given the limited understanding of lactylation-associated regulatory mechanisms in glaucoma and the need for reliable molecular biomarkers, this integrative strategy facilitates candidate gene prioritization and provides a basis for hypothesis generation in glaucoma research.
2. Materials and Methods
2.1. Data Download
Bulk transcriptomic datasets were downloaded from the Gene Expression Omnibus (GEO) database using the GEOquery R package (version 2.78.0). Expression matrices and corresponding sample metadata were retrieved for downstream analysis.
Log2 transformation was applied when required based on quantile distribution inspection to ensure approximate normality. Probes were mapped to gene symbols using platform annotation (GPL) files. Probes without valid gene annotations were removed. When multiple probes corresponded to the same gene symbol, the probe with the highest mean expression across samples was retained.
Sample grouping (POAG vs. Control) was defined according to metadata fields in the pData table and standardized prior to integration with the expression matrix. Samples and expression profiles were matched to ensure consistency before downstream analysis.
2.2. Differential Analysis
The Limma package (version 3.66.0) was used to identify differentially expressed genes (DEGs) between glaucoma and control samples, with empirical Bayes moderation of standard errors. DEGs were defined using a significance threshold of p < 0.05 and |logFC| > 1.
2.3. Single-Cell RNA Sequencing Data Processing
Single-cell RNA sequencing data were processed using the Seurat R packageversion (5.4.0). Raw count matrices were imported using the CreateSeuratObject function with filtering thresholds of min.cells = 3 and min.features = 200. The H9RimS1 sample was excluded before quality control procedures.
For each cell, the following quality control metrics were calculated: total UMI counts (nCount_RNA), number of detected genes (nFeature_RNA), percentage of mitochondrial genes (percent. mt), and percentage of ribosomal genes (percent.ribo). Cells deviating more than three median absolute deviations (MADs) from the median for any metric were removed. Cells with abnormally high UMI counts or gene numbers were excluded to reduce potential doublets. Cells with elevated mitochondrial or ribosomal gene percentages were removed to eliminate low-quality or stressed cells.
Data normalization was performed using the LogNormalize method, scaling each cell to a total of 10,000 counts followed by log transformation. Cell cycle scores (S.Score and G2M. Score) were calculated using the CellCycleScoring function. Highly variable genes were identified using FindVariableFeatures (method = “vst”, nfeatures = 2000). Prior to dimensional reduction, data were scaled using the ScaleData function, regressing out variation due to mitochondrial percentage (percent.mt), ribosomal percentage (percent.ribo), and cell cycle scores (S.Score and G2M.Score). Principal component analysis was performed using RunPCA (npcs = 50). The first 20 principal components were used for clustering. For multi-sample integration, batch effects were corrected using the Harmony algorithm. Uniform Manifold Approximation and Projection (UMAP) visualization and clustering were performed using the Harmony-corrected embeddings. Cell-type annotation was performed based on canonical marker genes reported in the literature, the CellMarker database, and the results of automated annotation using the SingleR package (version 2.12.0). Final cluster identities were assigned based on concordant results across annotation approaches.
2.4. Functional Enrichment Analysis
To explore the biological functions and pathways associated with DEGs, gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were conducted using the clusterProfiler R package (version 4.18.4). Enrichment results were considered statistically significant at p < 0.05 after multiple testing correction. Enriched biological processes, cellular components, molecular functions, and pathways were visualized to facilitate biological interpretation.
2.5. Weighted Gene Co-Expression Network Analysis (WGCNA)
Weighted gene co-expression network analysis (WGCNA), using the WGCNA R package (version 1.73), was performed to identify co-expressed gene modules associated with POAG. Prior to network construction, low-variance genes were filtered to reduce noise. Sample clustering was performed to identify potential outliers, and one outlier sample was removed based on a cut height of 100. The final dataset included 35 samples for WGCNA. The soft-thresholding power was determined using the pickSoftThreshold function with powerVector = 1:20 and RsquaredCut = 0.85. The selected soft-thresholding power was 9. The network was constructed using networkType = “unsigned” and TOMType = “unsigned”. Gene modules were identified using the dynamic tree cut algorithm with deepSplit = 2 and minModuleSize = 30. Module merging was performed using mergeCutHeight = 0.25. Genes were clustered into modules using hierarchical clustering, and module eigengenes were calculated. The correlation between module eigengenes and disease status was assessed to identify disease-associated modules. Lactylation-related gene sets were integrated to prioritize modules potentially linked to metabolic regulation.
2.6. Machine Learning for Prediction Model
To identify candidate biomarkers with potential diagnostic value, multiple machine learning algorithms were evaluated using the caret package (version 7.0-1). The input features consisted of normalized gene expression values of candidate genes derived from the intersection of differentially expressed genes (DEGs) and the lactylation-associated WGCNA module. Each sample was represented as a feature vector containing the expression values of these genes. The output of each model was a predicted probability score indicating the likelihood that a sample belonged to the POAG group. Probability scores were converted into binary class labels using a threshold of 0.5 for performance evaluation. Model performance was assessed using accuracy, sensitivity, specificity and the area under the receiver operating characteristic curve (AUC). To mitigate overfitting risks associated with the moderate sample size, we restricted feature dimensionality to biologically relevant candidate genes and applied internal cross-validation during model training. Model generalizability was further assessed using an independent external dataset (GSE9944).
2.7. Immune Infiltration Analysis
The CIBERSORT algorithm was applied to the GSE27276 bulk transcriptomic dataset to estimate the relative proportions of immune cell types within POAG and control samples. Differences in immune cell infiltration between groups were compared, and Spearman correlation analysis was performed to assess associations between the expression levels of potential diagnostic model genes and immune cell populations. Immune infiltration analysis was conducted independently from the machine learning pipeline and was not involved in feature selection or model construction.
2.8. GSVA and GSEA
Gene set variation analysis (GSVA) was performed using the GSVA package (version 2.4.7) to evaluate pathway activity changes between glaucoma and control samples at the individual sample level. This method generates an enrichment score for each sample based on coordinated expression of the predefined gene set. Specifically, a lactylation-related gene set was defined, and the ssGSEA method within the GSVA package was used to calculate a lactylation score for each sample as a single-sample enrichment score. This score was then incorporated as a continuous trait in WGCNA for module–trait correlation analysis. In parallel, gene set enrichment analysis (GSEA) was conducted to identify significantly enriched biological pathways in association with disease status. Curated gene sets were obtained from the Molecular Signatures Database (MSigDB), and pathways with FDR < 0.25 were considered significant.
2.9. Drug–Gene Interaction Prediction
Potential drug–gene interactions for the identified candidate genes were predicted using the Drug–Gene Interaction Database (DGIDB). Interactions supported by curated databases or experimental evidence were retained, providing preliminary insights into potential therapeutic targets for glaucoma.
2.10. Quality Control and Data Standardization
Quality control and downstream analysis of single-cell RNA sequencing data were performed using the Seurat package (version 5.4.0). Cells with low gene counts, excessive mitochondrial gene expression, or abnormal library sizes were excluded. Data were normalized and then subjected to principal component analysis (PCA). Batch effects across samples were corrected using the Harmony algorithm. The processed single-cell data were subsequently used to examine the cell type-specific expression patterns of candidate genes identified from bulk analyses.
4. Discussion
Collectively, this study establishes an integrative framework that links metabolic and immune-related transcriptional alterations to candidate diagnostic biomarkers in POAG. In this multi-omics framework, we conducted an integrative analysis combining bulk transcriptomics, single-cell RNA sequencing, immune deconvolution, and machine learning to investigate molecular alterations associated with POAG. By basing our analysis on lactylation-associated gene modules and validating findings across multiple analytical layers, we identified S100A2 and S100A14 as candidate diagnostic biomarkers and explored their potential biological context.
To understand the potential biological background of these markers, we conducted a systematic analysis of the global transcriptome alterations of POAG. Global transcriptomic analysis revealed widespread gene expression changes in POAG, consistent with previous reports demonstrating extensive molecular remodeling in glaucomatous trabecular meshwork and retinal tissues [
9]. Functional enrichment analyses highlighted oxidative stress responses and extracellular matrix organization, both of which have been recognized as central pathological features of glaucoma [
4,
5].
By incorporating lactylation-associated genes into a weighted gene co-expression network framework, our study extends prior transcriptomic analyses by linking metabolic reprogramming to disease-relevant gene modules. Protein lactylation has emerged as a key mechanism connecting altered cellular metabolism to transcriptional regulation under pathological conditions [
14]. Although the present study infers lactylation involvement at the transcriptomic level, the identified module provides a transcriptomic context for exploring metabolic regulation in POAG.
We further utilize integrated machine learning strategies to prioritize core candidate genes with potential diagnostic value from these modules. A major strength of this work is the use of an ensemble machine learning strategy to prioritize candidate biomarkers. Ensemble approaches have been shown to outperform individual algorithms in glaucoma classification and biomarker discovery tasks [
17]. Using this framework,
S100A2 and
S100A14 were consistently identified as the most informative predictors.
Members of the S100 protein family are known to participate in calcium signaling, inflammatory responses, and cellular stress pathways [
18]. Previous studies have reported altered expression of S100 family proteins in various ocular and neuroinflammatory conditions, supporting the potential relevance of
S100A2 and
S100A14 in glaucoma-associated molecular processes.
Furthermore, immune infiltration analysis demonstrated altered immune-related compositional patterns in POAG, with model genes showing consistent associations with immune cells and immune regulatory molecules. Increasing evidence supports a role for immune dysregulation and neuroinflammation in glaucomatous optic neuropathy [
6,
7]. Importantly, these observations should be interpreted as associative rather than causal. The limited statistical significance observed in several correlation analyses may reflect the moderate sample size and the inherent variability associated with bulk transcriptomic-based immune deconvolution methods.
Moreover, given the non-hematopoietic nature of trabecular meshwork tissue, immune deconvolution-based estimates reflect inferred immune-related transcriptional signatures rather than direct measurements of immune cell infiltration and should therefore be interpreted with appropriate caution. Nevertheless, the consistency between immune-associated patterns and pathway enrichment results supports the relevance of immune-related processes in POAG, as previously reported in experimental and clinical studies [
19].
To further examine the accuracy of our research at the cellular level, we introduced single-cell RNA sequencing data for in-depth analysis. The incorporation of single-cell RNA sequencing data represents another key strength of this study. Single-cell analyses showed that
S100A2 and
S100A14 exhibited sparse but non-uniform expression across cell populations, underscoring the importance of cellular context in interpreting POAG-related transcriptomic findings. Prior single-cell studies have demonstrated substantial cellular heterogeneity in ocular tissues that cannot be resolved using bulk transcriptomic approaches alone [
11].
By validating bulk transcriptomic findings at single-cell resolution, our study mitigates the confounding effect of cell composition changes and supports the biological relevance of the identified biomarkers.
As a preliminary exploratory step, the identification of
S100A2 and
S100A14 as candidate diagnostic biomarkers may inform future assay development, pending clinical validation for early disease detection or patient stratification. Moreover, drug–gene interaction analysis provides a preliminary framework for therapeutic exploration, aligning with recent efforts to repurpose existing drugs for glaucoma treatment [
20].
However, multiple glaucoma-associated biomarkers have been reported previously, but routine POAG diagnosis still relies primarily on established clinical and imaging parameters. This is largely because most molecular biomarkers remain at the candidate stage and have not yet undergone sufficient prospective validation, assay standardization, or demonstration of incremental clinical value beyond existing diagnostic approaches.
The feasibility of the analytical framework is supported by the use of publicly available datasets and well-established computational tools, suggesting that this strategy can be readily applied to other ocular diseases.
Although the above analytical framework shows good application prospects, there are still some limitations of this study that deserve attention. First, it relies primarily on retrospective public datasets with a moderate sample size. Although an independent external cohort was used for validation, the machine learning analysis should be interpreted as an exploratory analysis for candidate biomarker prioritization rather than as a formal clinical validation. Larger prospective clinical cohorts are required to further assess generalizability and real-world utility. Second, lactylation-related associations were inferred computationally and were not experimentally validated at the protein modification level. Third, while single-cell analyses provided valuable validation, current datasets remain limited in sample size and tissue diversity, which may restrict resolution of disease heterogeneity.
Future studies incorporating experimental validation of lactylation events, functional assays in relevant ocular cell types, and longitudinal clinical cohorts will be essential to further elucidate the mechanistic and clinical relevance of S100A2 and S100A14 in POAG.