Uncovering the Molecular Drivers of NHEJ DNA Repair-Implicated Missense Variants and Their Functional Consequences

Variants in non-homologous end joining (NHEJ) DNA repair genes are associated with various human syndromes, including microcephaly, growth delay, Fanconi anemia, and different hereditary cancers. However, very little has been done previously to systematically record the underlying molecular consequences of NHEJ variants and their link to phenotypic outcomes. In this study, a list of over 2983 missense variants of the principal components of the NHEJ system, including DNA Ligase IV, DNA-PKcs, Ku70/80 and XRCC4, reported in the clinical literature, was initially collected. The molecular consequences of variants were evaluated using in silico biophysical tools to quantitatively assess their impact on protein folding, dynamics, stability, and interactions. Cancer-causing and population variants within these NHEJ factors were statistically analyzed to identify molecular drivers. A comprehensive catalog of NHEJ variants from genes known to be mutated in cancer was curated, providing a resource for better understanding their role and molecular mechanisms in diseases. The variant analysis highlighted different molecular drivers among the distinct proteins, where cancer-driving variants in anchor proteins, such as Ku70/80, were more likely to affect key protein–protein interactions, whilst those in the enzymatic components, such as DNA-PKcs, were likely to be found in intolerant regions undergoing purifying selection. We believe that the information acquired in our database will be a powerful resource to better understand the role of non-homologous end-joining DNA repair in genetic disorders, and will serve as a source to inspire other investigations to understand the disease further, vital for the development of improved therapeutic strategies.


Introduction
Maintaining the integrity of the genome is crucial for any organism's survival [1].Double-strand breaks (DSBs) are deemed as one of the most harmful forms of DNA damage since, if left unrepaired, they can result in cell death, or chromosomal rearrangements if inappropriately repaired, leading to cancer [2].Nonhomologous DNA end joining (NHEJ) is one of the main DSB repair pathways used to repair DNA DSBs in mammalian cells and occurs throughout the cell cycle [3].
The primary participating factors in NHEJ DNA repair machinery include Ku70/Ku80 heterodimers, DNA-PKcs, XLF, XRCC4, and DNA Ligase IV.The Ku70/Ku80 heterodimer binding to the broken DNA initiates the NHEJ repair machinery.Hence, this recruits Genes 2023, 14, 1890 2 of 11 DNA-PK, whose autophosphorylation is vital for NHEJ.After DSB end-processing, Ligase IV interacts with XRCC4 and XLF to form an NHEJ-specific Ligase [4].
Variants have been defined in multiple components of the NHEJ DNA repair pathway, including PRKDC (encoding DNA-PKcs), XRCC4, XRCC5 (encoding Ku80), XRCC6 (encoding Ku70) and LIG4 (encoding DNA Ligase IV) [5].These variants have been associated with various human syndromes, including microcephaly [6], severe combined immunodeficiency (SCID) [7], growth delay, Fanconi anemia [8], and different hereditary cancers [9,10].Additionally, it has been demonstrated by a significant amount of genetic evidence that the loss or variation of the core NHEJ players leads to increased genomic instability and the development of cancer [11] Numerous studies have sought to identify genetic single nucleotide polymorphisms associated with carcinogenesis in the core NHEJ factors [12][13][14][15].Notably, many of these studies had small patient sample sizes and needed to be subsequently verified.Indeed, many of the NHEJ proteins are well described; however, information about the molecular consequences of missense variants in NHEJ's main components has yet to be fully characterized by a single source.Thus, this information is scattered throughout the literature.
Previously, we have demonstrated that computational approaches can be applied for a more profound understanding of the effects of missense variants on the 3D structure of the protein to elucidate the molecular mechanisms underlying the disease and improve the prediction of the disease prognosis [16][17][18].
We prioritized four NHEJ core factors, Ku70/80, Ligase IV, DNA-PKcs, and XRCC4, for our computational analysis due to the relatively high concentration of cancer-causing missense variants distributed in these factors.To that end, we characterized and analyzed cancer-causing missense variants' structural and functional consequences and compared them statistically to those caused by nonpathogenic (population) variants.In addition to providing the most exhaustive list of missense variants for NHEJ core components, this study incorporates a methodology for exploring and analyzing these variants to better understand vital mechanisms of genetic disorders.

Data Collection
As a starting point, the disease-causing (clinical) missense variants were collected from the COSMIC [19] database that incorporates somatic variants in human cancer.These variants were first curated in 2019 and updated in 2022.
The core NHEJ factors XRCC4 (n = 67) Ligase IV (n = 259), Ku70/80 heterodimer (n = 346) and DNA-PKcs (n = 654) were favored owing to their strong associations with disease and their respective enrichments in missense variants.Additionally, we collected a set of nonpathogenic variants based on population variation acquired using gnomAD [20] V.2.1.1;these variants were annotated using the Ensembl Variant Effect Predictor (VEP) [21] V.95.At this stage, we removed those variants that showed inconsistent variantal consequences across both the COSMIC and gnomAD databases to reduce the potential for misunderstanding, and any remaining population variants in XRCC4 (n = 83) DNA Ligase IV(n = 444), Ku70/80 heterodimer (n = 380) and DNA-PKcs (n = 1483) were regarded as nonpathogenic.

NHEJ Structural Curation
It was possible to obtain the experimental crystal structure of the Ku 70/80 heterodimer (PDB ID: 6ERG [22]) bound to DNA and XLF generated at a resolution of 2.90 Å. Structural pre-processing and minimization was performed using Maestro V.11.4 to fill in missing atoms and residues and to remove atomic clashes.This structure was used to calculate Ku70/80 variant features, while the AlphaFold2 structure of Ku70/80 was used to obtain the predicted local distance difference test (pLDDT) scores.DNA-PKcs' crystal structure (PDB ID: 5Y3R [23]) was available bound to Ku70/80, and it was used for calculating the variant features for DNA-PKcs.Similar to Ku70/80, the structure obtained using AlphaFold2 of DNA-PKcs was used to calculate pLDDT scores.As RCSB PDB lacks full experimental structures for LIG4 and XRCC4, AlphaFold2 [24] was used to generate full structures for these proteins.One experimental structure of LIG4 bound to DNA was found (PDB: 6BKG, residues 1-620) and used for calculating the variantal features, specifically (changes in nucleic acid (DNA) affinity).
We conducted in silico biophysical measurements based on mCSM-Stability [32], Dyna-Mut2 [33], DynaMut [34], SDM [35], and DUET [36] to predict variants' changes in stability and dynamics.Also, we calculated variantal effects on protein-protein interactions via mCSM-PPI [37] and mCSM-PPI2 [38], along with distances to the interface.The calculations of protein interactions included gene-dependent Ku heterodimer, DNA-PKcs and LIG4 bound to DNA.The associated impacts of these bindings on affinity were calculated for the experimental structures 6BKG and 5Y3R with mCSM-NA [39].For DNA-PKcs, the distance to ATP was measured.Using Arpeggio [40], we assessed the effects of variant on local molecular interactions.

Qualitative and Statistical Analysis
We compared the consequences of pathogenic and nonpathogenic variants on the calculated features using Welch's two-tailed t-test to determine potential molecular drivers in the NHEJ repair machinery.To evaluate features as potential molecular drivers, we looked for statistically significant differences between the two classes (p-value < 0.05).
A comparison of individual variants in terms of heterodimer affinity (mCSM-PPI and mCSM-PPI2), protein stability (DynaMut2), and vibrational entropy (ENCoM value, obtained using DynaMut) was performed, as previously described, based on their 0structural localization.Variants of KU70/80 were assessed based on heterodimer affinity changes rather than stability changes since even minor changes at the heterodimer interface can significantly contribute to pathogenicity.It is noteworthy that only variants located within a 10 Å of the protein-protein interface of the Ku70/80 heterodimer were examined, since heterodimer affinity has been regarded to subside over distance.

Model Training
Our final analysis used the ensemble algorithm ExtraTrees (with 100 trees) within Sci-kit Learn V.0.20.3.[41] to test the predictability of important features for phenotyping variants.A comparison of the performance of all features and subsets of important features in phenotyping was conducted, with important features highlighted from each model.

Data Curation and Variant Distribution of NHEJ Principal Components
The final curated database was acquired from COSMIC and gnomAD and incorporates a total of 1326 pathogenic and 2390 nonpathogenic missense variants in NHEJ main factors, spread across five genes, summarised in online Supplemental Table S1.Although missense variants in the main components of NHEJ repair machinery are not the only cause of the disease (cancer), computational approaches, such as those accounting for protein structural consequences, can effectively analyse these types of genetic variation.
Accordingly, we identified potential molecular drivers of disease by applying our computational analysis pipeline to the most mutated genes in NHEJ DNA repair machinery: LIG4, Ku70/80 (XRCC5/6), DNA-PKcs (PRKDC), and XRCC4.An overview of the phenotypes collected for each of the NHEJ core components is described in Table 1.Next, we visualized the distributions of the missense variants within the structures of NHEJ principal components (Supplemental Figure S1), which illustrated that cancercausing (pathogenic) variants were widely distributed across each gene of the NHEJ core components and their subsequent 3D structures without a specific localization.Similar patterns were observed for the population variants (nonpathogenic) within each of the principal NHEJ players.

Identifying Molecular Drivers in Ku70/80 Heterodimer
A comparison of the molecular effects of Ku70/80 pathogenic (n = 346) with nonpathogenic (n = 380) variants (Figure 1A) revealed that pathogenic variants were more likely to be found close to the protein-protein interface, leading to a disruption of the interaction between the KU70/80 heterodimer (Distance_Ku70_80 p-value: 0.043).In addition, as estimated by measures of functional deleteriousness (SNAP2 p-value: 0.022, PROVEAN p-value: 0.017, SIFT p-value: 0.003), pathogenic variants tend to occur at functionally essential protein regions.
Based on these effects, we developed a predictor that could correctly identify 91% of pathogenic variants and 96% of nonpathogenic variants.As a result of our predictor predictions, Distance_Ku70_80 has been deemed the most significant pathogenicity driver (contributed the most by 4%, Figure 1B).According to these observations, tumorigenesis is primarily associated with a Ku70/80 function disruption, where pathogenic variants are localized within the protein-protein interface.
As a final analysis, to determine the main drivers of pathogenicity in Ku70/80, we analyzed each pathogenic variant structurally (Figure 1C).We found that 54% of these variants decrease stability, and 60% reduce protein-protein interactions within the Ku heterodimer.The findings indicate that, in addition to reducing stability, Ku-mediated tumorigenesis is caused by a decrease in protein-protein interactions and conformational changes within the heterodimer.

Identifying Molecular Drivers of Pathogenicity in DNA-PKcs
An analysis of DNA-PKcs pathogenic variants (n = 654) compared to nonpathogenic ones (n = 1483) showed that pathogenic variants were more likely to be found in functionally important and intolerant regions undergoing purifying selection based on conservation (PAM30 p-value: 3 × 10 −5 ) and function (MTR score p-value: 0.010) calculations for proteins (Supplemental Figure S2A).Based on statistically significant features (A), interaction profiles associated with variants play an essential role in Ku-mediated tumorigenesis.Supervised machine learning (B) confirmed the high predictive potential of the changes in the proteinprotein interface of the Ku heterodimer.Through mapping variants on the heterodimer structure (C), we were able to verify that stability plays a critical role in disease (red), as do conformational changes (light blue).

Identifying Molecular Drivers of Pathogenicity in DNA-PKcs
An analysis of DNA-PKcs pathogenic variants (n = 654) compared to nonpathogenic ones (n = 1483) showed that pathogenic variants were more likely to be found in functionally important and intolerant regions undergoing purifying selection based on conservation (PAM30 p-value: 3 × 10 −5 ) and function (MTR score p-value: 0.010) calculations for proteins (Supplemental Figure S2A).
As a result of this localization, pathogenic variants were likewise highly likely to be solvent-accessible (RSA p-value: 0.004) and, hence, reactive towards binding partners.Regarding ligand binding (ATP), pathogenic variants were particularly clustered near the ATP binding sites (p-value: 0.004).Closeness to ATP binding implies that ATP-mediated changes in catalytic DNA-PKcs activity likely drive pathogenicity.
A machine learning-based predictor was trained using all significant features, which correctly classified 97% of pathogenic variants and 94% of nonpathogenic variants.Based on the various contributors to these predictions (Supplemental Figure S2B), it was found that changes in ATP-binding affinity (distance to ATP) contributed the most (16%).In contrast, the MTR score contributed substantially (8%).We also investigated changes in the DNA affinity of the DNA-bound structure (PDB 5Y3R, residues 1503-1538) caused by a subset of three pathogenic and 10 nonpathogenic variants (Supplemental Table S2).No notable differences between phenotypic classes were observed besides the significant enrichment of nonpathogenic variants.It is suggested, however, that DNA-mediated effects are not important drivers of tumorigenesis as the DNA-binding region within DNA-PKcs is enriched in nonpathogenic variants.
The phenotypes of all pathogenic and nonpathogenic variants in our dataset could be predicted using a machine-learning analysis combining these influential molecular descriptors.For the predictions, the developed classifier used functional scores represented by MTR-3D (4%), Envision (3.2%), SNAP2 (2.7%), and ∆∆G-sdm (2.6%, Figure 2B), further highlighting their involvement in pathogenicity.
Using changes in stability and vibrational entropy to analyze pathogenic variant in LIG4 (Figure 2C), we found that most variants were associated with increased flexibility (23%) or protein destabilization (30%), further establishing the role of protein conformational changes in tumorigenesis and pathogenicity of LIG4.

Uncovering Molecular Drivers in XRCC4
Although comparable drivers of pathogenicity were observed in XRCC4 (Figure 3A), as expressed by protein conservation (ConSurf p-value: 0.016), distinctive mechanisms for variant localization were seen.The distribution of phi angles (phi p-value: 0.007) for pathogenic variants (n = 67) was more distinct than that for nonpathogenic ones (n = 83); they tend to cluster at the core of the protein (residue depth p-value: 1.6 × 10 −5 ).In addition, variantal changes in stability highlighted that pathogenic variants in XRCC4 are highly destabilizing (∆∆G_dynamut p-value: 0.042).
When we trained a machine learning-based predictor by combining all of these significant features, 95% of pathogenic and 81% of nonpathogenic variants were correctly classified.The local residue environment represented by residue depth contributed the most towards the predictor predictions, followed by phi angle (17.5%),ConSurf (9%) and ddg_dynamut (9%) (Figure 3B).It is evident from the highlighted results that variant localization explains how different protein conformational states result in functional changes that are essential for elucidating disease.
When pathogenic variant consequences on the protein structure were examined in terms of protein stability and vibrational entropy (Figure 3C), it was found that, in XRCC4, pathogenic variants mainly cause clinical phenotypes by destabilizing the protein (35.8%), as well as causing a rise in protein flexibility (43%), indicating that protein conformational changes contribute to disease development as precursor mechanisms to carcinogenesis.

Uncovering Molecular Drivers in XRCC4
Although comparable drivers of pathogenicity were observed in XRCC4 (Figure 3A), as expressed by protein conservation (ConSurf p-value: 0.016), distinctive mechanisms for variant localization were seen.The distribution of phi angles (phi p-value: 0.007) for pathogenic variants (n = 67) was more distinct than that for nonpathogenic ones (n = 83); they changes that are essential for elucidating disease.
When pathogenic variant consequences on the protein structure were examined in terms of protein stability and vibrational entropy (Figure 3C), it was found that, in XRCC4, pathogenic variants mainly cause clinical phenotypes by destabilizing the protein (35.8%), as well as causing a rise in protein flexibility (43%), indicating that protein conformational changes contribute to disease development as precursor mechanisms to carcinogenesis.Welch's sample t-tests identified statistically significant features (A) that suggested protein conformation plays an essential role in XRCC4-mediated tumorigenesis, which may expose key residues close to the protein surface, as indicated by supervised machine learning (B).Residue depth had the most considerable predictive power.As we mapped variants in 3D (C), we were also able to establish the role of stability and conformational changes in disease (displayed in red and light blue).

Discussion
Our work comprehensively analyzes the consequences of missense variants driving carcinogenesis in the core players of the NHEJ DNA repair machinery, Ku70/80, XRCC4, Ligase IV and DNA-PKcs.Despite being involved in distinctive physiological functions, each of these genes shows standard molecular drivers of disease.Mainly, according to our predictors, all pathogenic variants were functionally deleterious across three core components of NHEJ, Ku70/80, DNA-PKcs and LIG4, while those in XRCC4 localized in areas with a low evolution rate (ConSurf).However, when viewing pathogenic and nonpathogenic variants within protein structures, all variants were widely distributed across the whole structure, with no particular domain localization.
Other distinctions between the five genes were associated with their unique biological functions.Specifically, pathogenic variants of LIG4 and XRCC4 reduced protein stability, consistent with our previous findings [5].
In addition, Ku70/80 interaction measurements suggest that protein-protein affinity change is crucial for Ku70/80 function, as pathogenic variants cluster near the protein interface and reduce the heterodimer binding.On a larger variant dataset, this study demonstrated the importance of protein-protein interaction affinity in Ku70/80-mediated carcinogenesis, which has been briefly investigated and implicated in destabilizing Ku's carboxy-terminal arm region that plays an essential role in heterodimerization [42].
Interestingly, when supervised machine learning was used to fit the data, these affinity changes had the highest phenotypic prediction potential.This observation suggests that these changes may play a role in tumorigenesis and pathogenesis.
Nevertheless, we could not observe conclusive impacts on nucleic acid affinity caused by pathogenic variants when examining the interaction profile of Ku70/80, DNA-PKcs and DNA Ligase IV variants to nucleic acids due to a need for more sufficient data.Despite extensive data curation, pathogenic variants were not detected within nucleic acid binding regions, suggesting that these interactions are crucial for transcription.
Regarding the four protein structures, the molecular drivers identified can be interpreted as a cause of cancer progression and genomic instability.Our work emphasizes how protein-protein affinity change plays a crucial role in Ku70/80-mediated disease, in which protein-protein affinity change presents the best predictability of classifying variants using machine learning.Among the robust predictive features of DNA-PKcs was the distance to ATP and MTR, since pathogenic variants tend to be found closer to ATP binding and intolerant regions undergoing purifying selection.Protein function consequences best indicated the LIG4 variant phenotype.Lastly, when viewing XRCC4, variant phenotype was best predicted by the residue depth, as pathogenic variants were found to occur at the protein core.A closer look at molecular changes revealed several disease mechanisms associated with carcinogenesis.
Over 2983 missense variants are listed in our final database (online Supplemental Table S1), making it the most comprehensive list of NHEJ missense variants available.Cancer patients have also been diagnosed with non-missense variants, such as indels; however, in our work, we focused on missense variants, since structure-based techniques can be used to analyze them with a high throughput, so the three-dimensional consequences of a variant can be adequately considered.
Our database represents the present landscape of missense variants in the NHEJ repair machinery.As missense variants are readily detected in the clinic, cross-referencing variants with our resource can help in the early detection of cancer risk, allowing for the development of therapeutic strategies to slow the disease's progression.In silico simulations of variantal change can be used to gain insight into disease mechanisms across various genes, as demonstrated in this work using LIG4, Ku70/80 (XRCC5/6), DNA-PKcs (PRKDC), and XRCC4.We have gained insight into disease development across the four proteins by combining structure-based estimators.Furthermore, since cancer is a complex disease with multiple aspects, the structural insights regarded in this study and the implications that may follow may be used to identify effective anticancer treatments.

Figure 1 .
Figure 1.Main drivers of Ku70/80 pathogenicity.Based on statistically significant features (A), interaction profiles associated with variants play an essential role in Ku-mediated tumorigenesis.Supervised machine learning (B) confirmed the high predictive potential of the changes in the proteinprotein interface of the Ku heterodimer.Through mapping variants on the heterodimer structure (C), we were able to verify that stability plays a critical role in disease (red), as do conformational changes (light blue).

Figure 1 .
Figure 1.Main drivers of Ku70/80 pathogenicity.Based on statistically significant features (A), interaction profiles associated with variants play an essential role in Ku-mediated tumorigenesis.Supervised machine learning (B) confirmed the high predictive potential of the changes in the proteinprotein interface of the Ku heterodimer.Through mapping variants on the heterodimer structure (C), we were able to verify that stability plays a critical role in disease (red), as do conformational changes (light blue).

Genes 2023 , 11 Figure 2 .
Figure 2. Main drivers of LIG4 pathogenicity.The statistically significant features (A) determined by Welch's sample t-test revealed that LIG4-mediated carcinogenesis is driven by both stability and MTR3D, which signifies functional deleteriousness, as verified via supervised machine learning (B), which had the most considerable predictive power.Most pathogenic variants (C) result in destabilizing (red) or increasing flexibility (light blue), providing further evidence of conformational effects.

Figure 2 .
Figure 2. Main drivers of LIG4 pathogenicity.The statistically significant features (A) determined by Welch's sample t-test revealed that LIG4-mediated carcinogenesis is driven by both stability and MTR3D, which signifies functional deleteriousness, as verified via supervised machine learning (B), which had the most considerable predictive power.Most pathogenic variants (C) result in destabilizing (red) or increasing flexibility (light blue), providing further evidence of conformational effects.

Figure 3 .
Figure 3. Main drivers of XRCC4 pathogenicity.Welch's sample t-tests identified statistically significant features (A) that suggested protein conformation plays an essential role in XRCC4-mediated

Figure 3 .
Figure 3. Main drivers of XRCC4 pathogenicity.Welch's sample t-tests identified statistically significant features (A) that suggested protein conformation plays an essential role in XRCC4-mediated tumorigenesis, which may expose key residues close to the protein surface, as indicated by supervised machine learning (B).Residue depth had the most considerable predictive power.As we mapped variants in 3D (C), we were also able to establish the role of stability and conformational changes in disease (displayed in red and light blue).