1. Introduction
The Diagnostic and Statistical Manual of Mental Disorders, version 5, defines intellectual disability (ID) as neurodevelopmental disorders that begin in childhood and are characterized by significant limitation in conceptual, social, practical skills [
1]. Similarly, the International Classification of Diseases 11th revision (ICD-11), also describes disorders of intellectual development (DID) as a condition arising during the developmental period, marked by substantially below-average intellectual functioning and adaptive behavior. These deficits are typically two or more standard deviations below the mean on standardized, individually administered tests, or determined by clinical assessment when formal testing is not feasible. Diagnosis requires evidence of deficits in intellectual functioning (reasoning, problem-solving, planning, abstract thinking, and learning), adaptive functioning in conceptual, social, and practical domains, and onset during the developmental period [
2]. ICD 11 characterizes autism spectrum disorder (ASD) as a persistent deficit in reciprocal social communication and interaction, accompanied by restricted, repetitive patterns of behavior, interests, or activities that are atypical or excessive relative to the individual’s age and sociocultural context. Symptom onset occurs during the developmental period, typically in early childhood, although difficulties may not become fully apparent until later. These symptoms result in clinically significant impairment in personal, familial, social, educational, or occupational functioning.
ID affects approximately 1% of the global population and imposes a substantial impact on society [
2]. Its etiology is highly heterogeneous, encompassing both genetic and non-genetic factors. Genetic causes account for 30–50% of cases [
3] while non-genetic contributors of ID can be grouped into prenatal, perinatal, and postnatal factors [
4]. Prenatal causes include maternal infections (e.g., cytomegalovirus, rubella, and Zika virus), teratogenic exposure (e.g., alcohol, antiepileptic drugs, and heavy metals) and maternal comorbidities such as phenylketonuria, pregnancy hypertension, asthma, urinary tract infection, pre-pregnancy obesity, uncontrolled diabetes, malnutrition, and obstetrical complications causing anoxia (placenta previa, placenta abruption, and umbilical cord prolapse) [
5,
6,
7]. Perinatal complication such as hypoxic–ischemic brain injury, extreme prematurity, or low birth weight are also major contributors [
8], while postnatal causes include infection (encephalitis, meningitis), head trauma, asphyxia, intracranial tumor, malnutrition and toxin exposure [
9]. Although advances in genomic technologies have improved the detection of monogenic and chromosomal etiologies, prevention of non-genetic causes through maternal health organization, infection control, and perinatal care [
10].
The genetic landscape of ID is remarkably complex, involving diverse molecular mechanisms. Current diagnostic gene panels include more than 1500 genes and up to 10% of the human genome are thought to influence ID pathogenesis [
11,
12]. Genotype–phenotype correlations have been established within specific syndromes and across variant types within the same gene. Recent analyses highlight broader molecular clusters with shared cognitive and behavioral features, underscoring ID as a highly heterogeneous group of disorders. Identification of genetic causes enhances understanding of ID pathophysiology while informing clinical prognosis, recurrence risk, and access to condition-specific support programs [
13].
Despite many ongoing studies, effective therapies for cognitive impairment remain limited except for a few metabolic disorders. Therefore, understanding the underlying genetic, molecular and cellular mechanisms is essential to identify novel intervention approaches.
In recent years, the advances of chromosomal microarray analysis (CMA) and next-generation sequencing (NGS) technologies have revolutionized ID diagnosis at the molecular level. CMA, recommended as a first-tier clinical test, can detect chromosomal copy number variants (CNVs) and achieve diagnostic yields at approximately 20% for neurodevelopmental disorder etiology [
14,
15]. For the remaining cases, NGS approach, including whole-exome sequencing (WES) and clinical exome sequencing (CES), is used and collectively explains approximately 35–50% of neurodevelopmental disorder cases [
3,
16,
17].
Beyond sequencing, multidimensional analysis has recently emerged as a complementary strategy to interpret the complex clinical spectrum of intellectual disability. This approach applies semi-quantitative scoring across multiple clinical domains (e.g., onset, motor, neurological, seizure, MRI, vision, dysmorphism, and systemic involvement), followed by Z-score normalization [
18] and hierarchical clustering analysis (HCA). By converting qualitative clinical observations into standardized quantitative matrices, multidimensional analysis enables systematic mapping of genotype–phenotype correlation and identification of phenotypic clusters reflecting shared molecular pathway. Such data-driven integration improves the power of genomic findings, facilitates patient stratification for functional studies and supports potential precision medicine applications [
19,
20,
21,
22,
23,
24,
25,
26].
Leveraging the strengths of NGS and phenotypic clustering, we performed CES and WES on 90 patients with ID and their relatives to elucidate the genetic etiology of ID. This study aimed to evaluate the diagnostic utility of NGS and to explore genotype–phenotype correlations through multidimensional analyses, thereby enhancing our understanding of the molecular and clinical complexity of ID.
2. Materials and Methods
This study adopted a stepwise diagnostic strategy to maximize the detection of underlying genetic etiologies in patients presenting with ID and/or ASD. The workflow integrated cytogenetic, sequencing, and confirmatory approaches to ensure comprehensive genomic characterization and accurate clinical interpretation (
Figure 1).
All enrolled patients who met the inclusion and exclusion criteria underwent a three-tier genetic testing protocol.
2.1. Study Population
The inclusion criteria for patient enrolment were as follows: (1) diagnosis of ID/ASD by a pediatric neurologist or psychiatrist (according to ICD-11 criteria); (2) clinical evaluation and genetic counseling performed by a clinical geneticist and (3) Absence of chromsomal abnormalities associated with ID/ASD and FMR1 full mutation; (4) received either Exome sequencing (ES) or CES, with ES performed in CES-negative cases to enable broader variant detection.
The exclusion criteria were as follows: (1) patients diagnosed with metabolic diseases based on clinical manifestations and laboratory evaluations; (2) patients whose ID/ASD resulted from perinatal ischemia, hypoxic brain injury, bilirubin encephalopathy, central nervous system infections, poisoning, extreme preterm birth (before 28 weeks of gestation), or trauma; and (3) patients or legal guardians who did not provide consent to participate in the study.
Between January 2023 and January 2025, seventy-five patients from 69 families, diagnosed with ID/ASD of indeterminate etiology, were referred to the Clinical Genetic Center at Hanoi Medical University Hospital by their families for genetic consultation prior to conceiving their next child.
This research is part of the state-level project “Screening and Etiology Diagnosis of Genetic Mental Retardation in Vietnamese People” (grant number ĐTĐL.CN-80/22, approved under Decision 2349/QĐ-BKHCN, Hanoi, 2 February 2022) and was conducted with ethical approval (Ethical Certification No. 93/GCN-HĐĐĐNCYYSSH-ĐHYHN, Hanoi, 9 June 2023).
2.2. Sample Collection
After obtaining written informed consent from the guardians, 2 mL of peripheral venous blood was collected from each participating child for WES, CES, and CNV detection. In families with multiple affected children exhibiting similar phenotypes, WES/CES was initially performed on a single child. Blood samples were also collected from parents and other siblings for further investigation when a related variant—pathogenic, likely pathogenic, or of uncertain clinical significance (VUS)—was identified in the affected child. Sanger sequencing was employed to confirm monogenic variants and determine their origin (de novo in dominant disorders) or allelic phase (in cis or trans in recessive disorders). Similarly, CNV analysis and karyotyping were performed in parents and relatives to establish the origin of CNV variants.
2.3. Whole Exome Sequencing and Clinical Exome Sequencing
Genomic DNA was enzymatically fragmented, and target regions were enriched using DNA capture probes. DNA libraries were prepared with a New England Biolabs kit (Ipswich, MA, USA) and enriched using targeted xGen probes from Integrated DNA Technologies (IDT, Coralville, IA, USA). The enriched libraries were sequenced either for the whole exome (WES) or for 4503 targeted genes (CES) using the Illumina NextSeq System (San Diego, CA, USA), achieving a minimum mean coverage of 100× and at least 95% of target regions covered at >10×. Sequenced reads were aligned to the human reference genome GRCh38 to identify variants.
2.4. Copy Number Variant Sequencing (CNVseq)
CNVseq is a NGS–based approach (using WES and CES) designed to identify chromosomal aneuploidies and DNA abnormalities ≥400,000 nucleotides. Quality control of sequencing data was performed with FastQC (version 0.11.9). DNA samples were sequenced on a NextSeq 550 platform using a paired-end 2 × 75 bp Reagent Kit (Illumina, San Diego, CA, USA). Raw sequences were aligned to the human reference genome (GRCh38) using BWA (v0.7.17) and SAMtool(v1.17), and duplicate reads were removed with Picard (v2.27.5) (Broad Institute, Cambridge, MA, USA). The processed data were annotated using the OMIM and DECIPHER databases for CNV analysis.
Variants with a frequency <1% in the Vietnamese genetic database (including the 1000 Exome Sequencing Project) were further analyzed. Variants with a minor allele frequency <1% in gnomAD, as well as deleterious variants reported in the Human Gene Mutation Database (HGMD), ClinVar, or the internal biodata bank, were assessed. Significant variants were localized within coding exons and ±10 nucleotides of splice junctions, supported by definitive genotyping evidence per OMIM® data.
All variants were classified according to ACMG/AMP criteria and ClinGen recommendations into five categories: pathogenic, likely pathogenic, VUS, likely benign, and benign. Variants relevant to the patient phenotype were documented. Variants with low sequence quality or ambiguous genotypes were confirmed by Sanger sequencing, resulting in a specificity exceeding 99.9% for all documented variants. Certain genetic changes—including gene rearrangements (fusions, inversions), short tandem repeat variations, pseudogenes, mosaic variants, and mitochondrial gene variants—may not be detected by this method.
2.5. Sanger Sequencing
Gene sequences and mutation locations of interest were obtained from the NCBI Genome database (GRCh38). Primers specific to the target genes were designed using Primer3Plus, evaluated for specificity using NCBI Primer-BLAST (NCBI, Bethesda, MD, USA) and the UCSC In-Silico PCR tool (Santa Cruz, CA, USA), synthesized by IDT, and employed for PCR amplification. PCR was performed on the Applied Biosystems ProFlex™ (Thermo Fisher Scientific, Waltham, MA, USA) 3 × 32-well PCR System using the relevant DNA sequences of the variants as templates. Purified PCR products were subjected to Sanger sequencing on an Applied Biosystems 3130xl Genetic Analyzer with POP-7™ polymer. Sequencing results were analyzed using CodonCode Aligner software (v9.0.1), with alignment and variant verification performed against the UCSC Genome Browser (GRCh38) and NCBI BLAST to determine mutation site and type.
2.6. Clinical Phenotyping and Semi-Quantitative Scoring System
Quantitative assessment of clinical severity in genetic and neurodevelopmental disorders has increasingly adopted structured, multi-domain scoring systems to capture phenotypic variability in a standardized and reproducible manner. Previous studies—such as those by Klouwer et al. (2018) in Zellweger spectrum disorders, Aglan et al. (2012) in osteogenesis imperfecta, and the Cornelia de Lange syndrome clinical severity scale—have demonstrated that ordinal scoring across multiple clinical domains provides an objective framework for comparing disease burden and functional outcomes [
27,
28].
Similarly, Nagy et al. (2022) [
29] applied a four-tiered severity scale for POGZ-related neurodevelopmental disorders to classify patients according to overall phenotypic impact. These approaches share key methodological features, including multi-systemic coverage, ordinal grading reflecting progression from mild to severe manifestations, and incorporation of both neurological and systemic parameters—all of which informed the design of the present scoring model [
29].
Accordingly, our study employed a 15-domain ordinal scoring system, with each domain rated from 1 (mild or late-onset) to 4 (severe, early-onset, or multisystemic), and an additional “NA/Normal” category for absent or unassessed features. This framework enables consistent quantification of clinical heterogeneity and facilitates genotype–phenotype correlation analyses across the studied cohort. All scores were derived from medical records, clinical examination reports, EEG/MRI findings, and genetic test results, and independently reviewed by two clinical geneticists to minimize subjectivity. This structured scoring system enabled transformation of complex clinical data into quantitative matrices, which were subsequently used for correlation analyses, Z-score normalization, and hierarchical cluster analysis (HCA) to identify phenotypic subgroups and gene–pathway associations (
Table 1).
2.7. Statistical and Multidimensional Analysis
Descriptive statistics were used to summarize the demographic and clinical characteristics of the cohort. Continuous variables (e.g., age) were reported as mean ± standard deviation (SD), while categorical variables were presented as frequencies and percentages. Comparisons between genetic analysis–positive and –negative patients were performed using independent t-tests for continuous variables and chi-square tests for categorical variables. Statistical significance was defined as p < 0.05. All analyses were conducted using SPSS (version 26.0; IBM Corp., Armonk, NY, USA) and R (version 4.4.1; R Foundation for Statistical Computing, Vienna, Austria).
Beyond conventional univariate statistics, an integrated multidimensional analytical framework was developed, building on methodologies previously established in Vietnamese biomedical data analytics [
19,
20,
21,
22,
23,
24,
25,
26]. These prior studies validated the application of hierarchical clustering, Z-score normalization, and correlation-based multidimensional modeling to investigate complex clinical–genetic interactions.
All quantitative clinical features were standardized using Z-score normalization, defined as
where
is the raw value for variable
j in individual
I,
is the mean, and
is the standard deviation for feature
j. This transformation ensured comparability across heterogeneous clinical scales by rescaling all domains to a mean of 0 and SD of 1. The normalized matrix was subjected to hierarchical cluster analysis (HCA) using Euclidean distance and complete linkage to identify latent groupings of patients sharing similar clinical patterns. Heatmap visualization integrated HCA-derived structure with gene-level and biological pathway annotations (e.g., ion channel regulation, transcriptional control, myelination, and metabolic signaling) to highlight phenotypic convergence and mechanistic differentiation across gene groups.
Pairwise relationships among clinical domains were quantified using Pearson’s correlation coefficients (R) with corresponding p-values, visualized in a color-coded correlation heatmap. This matrix-based representation provided insight into interdependent phenotypic features and potential syndromic clusters within the cohort.
3. Results
3.1. Sex and Age Distribution of Gene Variants
Among 75 patients with ID, 47 were male and 28 were female with ages ranging from 2 to 23 years (mean ± SD: 9.7 ± X years). Of these, 59 underwent WES, 10 underwent CES, and the remaining six affected siblings had their genetic diagnoses confirmed by Sanger sequencing (
Supplementary Tables S1–S3).
Causal variants were identified in 38 of 75 patients, comprising 28 cases with pathogenic or likely pathogenic variants linked to monogenic diseases and 10 cases with pathogenic CNVs. Additionally, seven unconfirmed cases carried either variants of unknown clinical significance (VUSs) or a pathogenic heterozygous variant inherited from an unaffected parent.
A small proportion of cases were diagnosed between ages 3 and 6 years (8.3% for CNVs and 5% for SNVs), whereas most CNV cases (41.7%) were identified in children aged 6–8 years. In contrast, SNV-positive cases were predominantly observed in older children, with 67.5% detected in those over 8 years of age. The mean age for CNV cases was 8.5 ± 3.5 years, compared with 10.4 ± 4.4 years for SNV cases, suggesting that CNVs tend to be diagnosed earlier than SNVs, likely due to the more pronounced clinical features associated with larger genomic alterations [
11].
Among patients analyzed by CES or WES, the overall molecular diagnostic yield (detection of pathogenic or likely pathogenic variants) did not differ significantly between males and females (
Supplementary Figure S1). Although males exhibited a slightly higher positive rate, this difference was not statistically significant (
p > 0.05, Chi-square or Fisher’s exact test). Similarly, age distribution did not differ between patients with positive and negative genetic findings (
Supplementary Figure S2). Median ages were comparable, as confirmed by the Mann–Whitney U test (
p = 0.598). Age density distribution (
Supplementary Figure S3) showed substantial overlap between the two groups, indicating that diagnostic yield was independent of age and that no apparent age-related bias influenced the detection of pathogenic variants.
3.2. Monogenic Variants in Children with ID
Among the 28 cases with identified monogenic variants, 11 de novo variants were detected in nine families, along with 15 novel variants. Parental and sibling samples were subsequently analyzed by Sanger sequencing to confirm segregation patterns. All variants were classified according to the ACMG/AMP/ClinGen Sequence Variant Interpretation (SVI) guidelines.
Our analysis highlighted a strong association between neurological impairment, motor development delays, intellectual disability, and dysmorphic features in genetic disorders. Most genes were linked to multiple phenotypic domains, underscoring the importance of comprehensive diagnostic assessment. Syndromic forms of ID, such as Cornelia de Lange (NIPBL), Pitt-Hopkins (TCF4), and Baraitser-Winter (ACTG1), frequently exhibited distinctive craniofacial features along with severe cognitive and motor deficits. In contrast, isolated neurological conditions, such as phenylketonuria (PKU) caused by PAH gene mutations, illustrate that some disorders can be effectively managed through early detection and timely intervention.
3.3. Visualization and Key Insights
3.3.1. Gene Mutation Frequency and Mutation Type Distribution
NKX6-2 exhibited the highest mutation frequency, with five confirmed cases from two families, highlighting its potential clinical relevance.
SCN2A followed, with three confirmed cases from two families, consistent with its well-established role in epilepsy and neurodevelopmental delay. Mutations in
PRUNE1,
SMAD6, and
NIPBL were deteced at lower frequencies, each in a single case. Additionally, variants in
HEPHL1,
MT-SS2, and
OSCEP were identified only in unconfirmed cases, indicating a need for further validation (
Supplementary Table S1).
Among the 18 genes identified in confirmed cases, nine were linked to autosomal dominant diseases or syndromes, with nearly all pathogenic variants occurring de novo, except for two instances in which parental samples were unavailable. Of the remaining cases, seven genes were associated with autosomal recessive inheritance, and two with X-linked recessive disorders.
Loss-of-function variants were the most common mutation type, with frameshift variants—arising from insertions and deletions—being the predominant subtype. Missense variants were also prevalent across several genes. Overall, three affected genes were associated with autosomal recessive diseases, whereas four corresponded to autosomal dominant conditions (
Figure 2).
3.3.2. Phenotypic Spectra
ID was the predominant clinical feature among the studied patients. Moderate-to-severe ID was observed in individuals carrying variants in TCF4, IQSEC2, ADSL, NGLY1, and PGAP3. Severe cognitive impairment occurred in patients with mutations in NKX6-2, PLP1, PAH gene (when untreated), and SCN2A (in early-onset type), whereas milder ID was associated with SCN2A (in late-onset type) or SMAD6 variants.
Distinct dysmorphic features were noted in several cases. For example, a patient with a NIPBL mutation (Cornelia de Lange syndrome) presented with synophrys, long curly eyelashes, a depressed nasal bridge, and a thin upper lip, along with vesicoureteral reflux. A patient with an ATP1A3 mutation (Snijders Blok-Campeau syndrome) exhibited frontal bossing and a round face. A SMAD6 variant was associated with premature craniosynostosis and hip dislocation, while TCF4 mutations (Pitt-Hopkins syndrome) produced characteristic facial anomalies, including coarse facial features, a protruding lower face, full cheeks, and cupid’s bow upper lip, in two distinct patients. In contrast, patients with IQSEC2, KCNN4, ADSL, or SCN2A variants did not display external dysmorphic traits.
Overall, variants in
SCN2A, IQSEC2, and
TCF4 had pronounced impact on neurological, motor, and cognitive functions across patients. Syndromic conditions, such as Cornelia de Lange, Pitt–Hopkins, and Snijders Blok–Campeau syndromes, were characterized by distinctive facial and skeletal anomalies, whereas other disorders primarily manifested with severe neurological symptoms in the absence of dysmorphic features. Importantly,
PAH deficiency caused phenylketonuria (PKU), a treatable metabolic disorder, whereas management of other genetic conditions focused on symptomatic and supportive care rather than curative interventions (
Table 2,
Supplementary Table S2).
The Venn diagram illustrates the genetic overlap among major clinical phenotypes, including neurological features (seizures, hypotonia, and spasticity), motor development challenges (including loss of ambulation and fine motor impairment), ID, and dysmorphism (distinct facial or skeletal anomalies) (
Figure 3,
Table 3). This overlap suggests that genes influencing neurological function often contribute to both motor and cognitive dysfunction, which is consistent with the shared neurological pathways regulating both movement and cognition. Examples include
SCN2A,
IQSEC2, and
ADSL, whose mutations are associated with seizures, motor delays, and severe intellectual disability. Likewise,
PLP1,
NKX6-2, and
ATP1A3 mutations are linked to spasticity, ataxia, and cognitive deficits.
Pathogenic variants in genes, such as PGAP3, CHD3, and SMAD6, can impair motor and cognitive performance without producing major neurological symptoms like epilepsy, suggesting that these deficits arise from developmental or structural abnormalities rather than from direct neurodegeneration.
Mutations associated with syndromic conditions often combine neurological and cognitive impairments with dysmorphic features. Loss-of-function variants in NIPBL cause Cornelia de Lange syndrome; missense variants in ACTG1 lead to Baraitser-Winter syndrome; missense variants in ATP1A3 underlie Snijders Blok-Campeau syndrome; and loss-of-function variants in TCF4 result in Pitt-Hopkins syndrome. These syndromes usually present with both facial and skeletal anomalies, reflecting their broad developmental impacts. In contrast, loss-of-function variants in SMAD6 are primarily associated with craniosynostosis, where skeletal malformations dominate, and neurological effects are limited.
3.4. Multidimensional and Pathway-Level Analysis of Genotype–Phenotype Associations
Given the absence of demographic bias, further multidimensional analyses were performed to investigate genotype–phenotype relationships and pathway-level clustering. Using the semi-quantitative clinical scoring system, numerical scores were assigned to major phenotypic domains for each patient carrying monogenic variants (
Table 4).
This standardized approach enabled objective assessment of clinical severity across individuals and facilitated correlation and clustering analyses to delineate genotype–phenotype relationships. No consistent sex-related differences were observed across most clinical domains, indicating that disease severity and phenotypic distribution were generally comparable between males and females. Minor variability was noted in a few parameters (e.g., neurological, cognitive, and EEG domains), but none reached statistical significance after correction (
p > 0.05, Mann–Whitney U test) (
Supplementary Figure S4).
Similarly, no significant correlation between age and overall clinical severity was observed across most domains, suggesting that genotype rather than age progression predominantly determined disease expression. Mild trends were noted; for example, older patients tended to have slightly higher neurological and vision scores, whereas onset and ID scores decreased with age, potentially reflecting developmental compensation or diagnostic timing effects (
Supplementary Figure S5).
The HCA identified distinct patient subgroups with shared clinical characteristics. The left (red) cluster included individuals with early onset, severe multisystem involvement (e.g., neonatal onset, high neurodevelopmental and MRI severity scores), whereas the right (blue) cluster comprised patients with milder or later-onset phenotypes. These clusters indicate genotype–phenotype convergence and provide a framework for subsequent pathway-level and gene-based analyses (
Figure 4, left).
Distinct clinical patterns emerged among the three HCA clusters. Cluster 1 (red) exhibited elevated scores across multiple domains (e.g., onset, neurological, cognitive, MRI, and seizure), reflecting early-onset, severe multisystem disease. Cluster 2 (green) displayed intermediate severity, particularly in neurodevelopmental and cognitive domains, representing moderate phenotypes. Cluster 3 (blue) showed relatively milder and more localized abnormalities, often dominated by motor and perinatal features. These distributions demonstrate that HCA effectively stratified patients into biologically meaningful subgroups with differing phenotypic severity profiles (
Figure 4, right).
Hierarchical cluster analysis (HCA), based on Euclidean distances of standardized Z-scores across 15 clinical domains (e.g., age at onset, motor function, neurodevelopment, MRI findings, and seizures), stratified patients—not genes—into three distinct phenotypic groups (
Figure 4). Each group represented a unique constellation of disease severity, onset, and system involvement. Notably, this patient-level clustering was derived exclusively from multidimensional clinical similarity, entirely independent of genotype. Gene and pathway annotations were subsequently incorporated post hoc to assess the biological coherence of the resulting patient clusters.
Group 1—Severe multisystem neurodevelopmental disorders (Transcriptional/RNA-processing phenotype): This group included patients carrying variants in POLR1C, TCF4, HNRNPU, NIPBL, and ACTG1. These patients exhibited the highest composite Z-scores (≥+2) across onset, neurological, motor, and MRI domains, corresponding to early-infantile onset, profound motor dysfunction, extensive cortical and white-matter abnormalities, and severe intellectual disability. Functionally, these genes are involved in transcriptional regulation, chromatin remodeling, and RNA-processing mechanisms, and their disruption leads to global developmental arrest. Representative examples include POLR1C (RNA polymerase dysfunction), TCF4 (Pitt–Hopkins syndrome), and HNRNPU (epileptic encephalopathy). Thus, Group 1 represents the most severe, multisystem neurodevelopmental phenotype characterized by early developmental arrest and transcriptional dysregulation.
Group 2—Intermediate epilepsy/metabolic disorders (Ion-channel and neuronal-excitability phenotype): This group comprised patients harboring variants in SCN2A, PAH, IQSEC2, and GNPAT. These patients exhibited moderate overall severity, characterized by high seizure Z-scores (>+2), intermediate onset and neurological scores, and minimal systemic or dysmorphic involvement. The phenotype was dominated by neuronal excitability and metabolic dysfunction primarily affecting the central nervous system. Pathogenically, SCN2A variants alter voltage-gated sodium channel activity, leading to early infantile epileptic encephalopathy, whereas PAH and GNPAT variants disrupt amino acid and peroxisomal metabolism, respectively. Thus, Group 2 represents an intermediate neuro-motor severity profile driven predominantly by epileptic–metabolic mechanisms.
Group 3—Mild or focal neurodevelopmental presentations (Myelination/structural-signaling phenotype): This group comprised patients carrying variants in NKX6-2, PLP1, PGAP3, SMAD6, and ATP1A3. These patients exhibited the lowest composite Z-scores (<+1.5) and later onset, typically diagnosed after two years of age. Clinically, they presented with partial motor delay, mild intellectual disability, or leukodystrophy-like changes, occasionally accompanied by subtle visual or perinatal involvement. Functionally, these genes are implicated in myelination and intracellular signaling pathways—for example, NKX6-2 (oligodendrocyte differentiation), PLP1 (myelin formation), and SMAD6 (TGF-β signaling modulation). Thus, Group 3 represents milder or more localized neurodevelopmental disorders with limited systemic impact and slower progression.
Although the HCA was based solely on Euclidean distances of clinical Z-scores, subsequent gene annotation revealed non-random biological clustering among patient groups. Pearson’s chi-squared test confirmed a significant association between gene distribution and phenotypic groups (χ
2 = 54.566, df = 34,
p = 0.0141), indicating that genes and pathways were not randomly distributed across clusters. Consistent with this, pathway-level enrichment analyses (
Figure 5) (
Supplementary Figures S6–S8) demonstrated distinct biological signatures: Group 1, transcriptional and RNA-processing pathways; Group 2, ion-channel and metabolic pathways; and Group 3, myelination and structural/signaling pathways. Together, these findings confirm that Euclidean-based patient clustering effectively captured the underlying genotype–phenotype architecture, reinforcing the biological validity and analytical robustness of the multidimensional HCA framework.
The correlation matrix of clinical severity domains (
Figure 6) revealed a well-structured internal architecture, reflecting strong interrelationships among neurodevelopmental aspects of the phenotype. ID, neurological, motor, and mental domains were strongly correlated (
R ≈ 0.6–0.8,
p < 0.05), indicating that the degree of motor and behavioral impairments parallels cognitive dysfunction. This pattern highlights a coherent neurocognitive cluster in which central nervous system dysfunction drives both intellectual and behavioral outcomes.
A significant correlation was also found between perinatal and onset scores (R ≈ 0.6, p < 0.05), suggesting that perinatal complications may accelerate disease onset or exacerbate symptom severity. In contrast, vision and MRI domains showed weak or non-significant correlations with other traits, indicating genotype-specific or structurally localized effects rather than generalized markers of severity.
Collectively, the correlation map delineates a multidimensional yet internally structured clinical phenotype architecture. While neurodevelopmental domains exhibited strong interrelationships, sensory and imaging parameters were more variable. These findings validate the semi-quantitative clinical scoring system as a biologically meaningful tool for subsequent hierarchical clustering and multidimensional pathway integration analyses.
3.5. Copy Number Variants in Children with ID
Ten children with ID harbored CNVs. Among them, two had only duplications (CNV1, CNV2), six had only deletions (CNV3-8), and two had both duplications and deletions (CNV9, CNV10). CNV sizes ranged from 1.3 Mb to 19.1 Mb. The 6.2 Mb duplication on 16p detected in proband CNV1 was a de novo variant, whereas the 19.1 Mb duplication on 6q detected in proband CNV2 was derived from a paternal balanced chromosomal translocation t(6;20)(q25;p13). In addition, two unrelated children (CNV4, CNV7) carried a de novo deletion del(7)(q11.23), which is linked to Williams-Beuren syndrome.
Interestingly, two patients exhibited both a terminal deletion and a terminal duplication despite normal parental karyotypes and CNV sequencing results. The first case (CNV9) involved an eight-year-old girl with a 6.7 Mb terminal duplication on 10p and a 6.9 Mb terminal deletion on 18q, suggesting either a spontaneous unbalanced chromosomal translocation or a cryptic parental chromosomal rearrangement. The second case (CNV10) was a four-year-old boy with a 15 Mb duplication at 2q36.3–q37.3 and a 2.5 Mb deletion at 4q35.2. Furthermore, an eight-year-old boy (CNV8) carried a 10.2 Mb deletion on chromosome 9q and had a family history of multiple affected relatives with the same CNV, despite his parents having normal karyotypes and CNV sequencing results (
Figure 7).
Phenotypic evaluation of the CNV-positive cohort revealed that five of the ten patients displayed prenatal abnormalities detectable by ultrasound, and all exhibited dysmorphic facial features. Three patients had congenital heart defects, and three male patients presented with testicular abnormalities. Regarding cognitive outcomes, six patients had severe-to-profound ID, one had mild ID, and the remainder had moderate ID. In terms of motor development, three children were unable to walk, whereas the others experienced mild-to-moderate motor delays (
Table 5 and
Table 6).
3.6. Unconfirmed Cases
Seven cases with monogenic variants remained inconclusive. Notably, three of these patients carried biallelic variants in genes associated with autosomal recessive disorders, classified as variants of uncertain significance (VUS) but with high predicted pathogenicity according to ACMG/AMP/ClinGen SVI guidelines.
The first case (ID22) was an eight-year-old girl carrying an in-cis compound heterozygous variant in the
OSGEP gene: a paternal c.280C>T (p.Arg94Cys) and a maternal c.143A>G (p.His48Arg) variant. Both missense variants were classified as VUS. Clinically, the patient exhibited severe ID, nephrotic syndrome, loss of ambulation, and loss of speech, features consistent with the clinical features of Galloway–Mowat syndrome type 3 (GAMOS3) [
30]. Her older brother with similar clinical manifestations had passed away without undergoing genetic testing.
The second case (ID23) was a 16-year-old boy presenting with speech delay, moderate ID, normal motor development, and attention-deficit/hyperactivity disorder (ADHD). He carried an in-trans compound heterozygous variants in the HEPHL1 gene: a maternal c.2857C>T (p.Arg953Ter) and a paternal c.84G>A (p.Thr28=) variant. The c.2857C>T variant introduces a premature termination codon (PTC) and is classified as pathogenic, whereas the c.84G>A variant, located outside the splice region and not predicted to alter splicing, is classified as VUS despite occurring in trans with a pathogenic allele.
The third case (ID24) involved an 11-year-old girl with profound intellectual disability, loss of ambulation, and hypotonia. Her deceased older brother exhibited similar symptoms but had not undergone genetic testing. The patient carried a homozygous c.1066C>T variant in the
PRUNE1 gene, inherited from both parents. Biallelic pathogenic variants in
PRUNE1 have been known to cause neurodevelopmental disorder with microcephaly, hypotonia, and brain abnormalities [
31]. The identified variant, a null mutation in the last exon, is not predicted to undergo nonsense-mediated decay (NMD) but likely results in truncation of more than 10% of the transcript. Its extremely low frequency in the gnomAD population supports classification as a VUS with high pathogenic potential. Prenatal diagnostic testing performed during the mother’s subsequent pregnancy did not detect the
PRUNE1 variant, and the newborn, now one year old, exhibits normal psychomotor development.
The recurrent
MTSS2:c.1790C>T variant was found in two unrelated cases. Although previously reported as likely pathogenic [
32], its high allele frequency in East Asian populations (gnomAD genome frequency: 0.002327) suggests reduced pathogenicity. Moreover, this variant was recently listed among genes excluded from ACMG prenatal screening for genetic disorder with high allele frequency (>1%) in VN1K genome database, a population scale multiomics and phenomics resource for the Vietnamese population) [
33]. In both families, the variant was inherited from a phenotypically normal parent, supporting its reclassification as a VUS with low pathogenic potential.
Additionally, one child (ID26) carried a loss-of-function (LOF) variant in ATAD3A, associated with a known genetic disorder, but inherited from an unaffected mother. Another patient (ID25) had an LOF variant in COL4A1, inherited from an unaffected father. Due to the variable penetrance and expressivity of these genes, both variants were ultimately classified as VUS.
4. Discussion
This study highlights the genetic complexity of ID, demonstrating that both SNVs and CNVs significantly contribute to the disease spectrum. Our data revealed a male predominance, with males affected at approximately twice the frequency of females, consistent with previous reports [
34,
35]. Notably, SNV-associated ID cases were generally diagnosed later than CNV-associated cases, suggesting that early genetic screening could facilitate timely diagnosis and improve disease management. Prior international studies have similarly shown that early implementation of genomic testing—particularly exome or genome sequencing—shortens the diagnostic odyssey and enhances clinical care for patients with neurodevelopmental disorders [
3].
WES and CES were the primary tools used to identify both SNVs and CNVs in coding regions, representing a practical, dual-purpose approach in resource-limited settings [
34]. However, CNV analysis based on WES/CES data cannot reliably detect balanced structural variants, mosaicism, and small CNVs [
36], highlighting the need for complementary testing in unresolved cases such as CNV8. Importantly, several CNV-positive patients exhibited prenatal abnormalities detectable by ultrasound but remained undiagnosed before birth, underscoring the value of incorporating genomic testing into routine prenatal diagnostics [
32]. Furthermore, trio-based sequencing, which was not consistently applied in this cohort, could improve diagnostic accuracy by identifying de novo mutations—a key indicator of pathogenicity in dominant disorders [
37].
From a genetic perspective, this study uncovered recurring mutations in genes, such as NKX6-2, PAH, and SCN2A, where frameshift and nonsense mutations were associated with pronounced phenotypic effects. The high frequency of missense mutations underscores the need for functional studies to assess their pathogenic potential. Additionally, splice-site mutations, which can result in exon skipping or aberrant protein synthesis, point to diverse molecular mechanisms underlying ID. These findings emphasize the importance of integrating genetic and clinical data to refine variant interpretation and enhance diagnostic precision.
Phenotypic variability among patients carrying mutations in the same gene was also evident. For example,
SCN2A variants produced a broad clinical spectrum ranging from epilepsy to severe neurodevelopmental disorders, reflecting variable expressivity and genotype-specific functional effects [
38]. While many genes contribute to the heterogeneity of ID, isolated cases remain relatively rare.
PAH mutations, for instance, mainly result in PKU, which, if left untreated, leads to neurological impairment without significant dysmorphism or motor difficulties. This highlights the critical importance of early metabolic screening and intervention, as timely dietary and metabolic management can prevent or substantially reduce neurological deterioration [
39]. Clinically, early genetic testing is particularly valuable in patients presenting with combined neurological, motor, and intellectual impairments. The presence of dysmorphism alongside neurodevelopmental delays should prompt consideration of syndromic disorders such as Cornelia de Lange, Pitt-Hopkins, and Baraitser-Winter syndromes.
For cases with VUS, some patients exhibited biallelic variants, consistent with the suspected diagnosis and family history, yet these did not meet the current ACMG/AMP/Clingen criteria for pathogenicity [
40,
41]. In such instances, functional validation studies or segregation analysis in first-degree relatives are recommended to strengthen variant interpretation. Furthermore, for families harboring potentially deleterious variants, prenatal or preimplantation genetic testing should be considered to mitigate recurrence risk in future pregnancies [
42]. In cases where a deleterious variant was detected in both patient and a healthy parent, (e.g., patient ID25 with a heterozygous
COL4A1 variant) variable age of onset and incomplete penetrance warrant ongoing clinical surveillance and functional validation [
43].
Visual tools, such as Venn diagrams, proved valuable for integrating genetic and phenotypic data, enabling clear visualization of overlapping domains, including intellectual disability, motor deficits, and dysmorphism. Expanding this framework to incorporate additional clinical subcategories—such as severity of ID or epilepsy subtypes—could further enhance diagnostic precision and support personalized management strategies.
Collectively, these findings underscore the importance of early genetic screening, rigorous variant classification, and the use of complementary approaches, including long-read sequencing and functional assays, to resolve inconclusive cases. This study employed an integrative, multidimensional analytic framework to characterize genotype–phenotype relationships in patients with ID and ASD. By combining semi-quantitative clinical scoring, correlation mapping, hierarchical cluster analysis, and pathway-level integration, we delineated the internal structure of phenotypic heterogeneity and its molecular correlates.
Our results demonstrated that the distribution of genetic diagnostic yield was not significantly influenced by sex or age, supporting the biological consistency of the analyzed cohort. The semi-quantitative clinical scoring system, encompassing 15 domains—including developmental, neurological, structural, and systemic features—provided a harmonized and granular representation of disease burden. This approach, consistent with prior multidimensional frameworks [
19,
20,
21,
22,
23,
24,
25,
26], enabled systematic evaluation of phenotypic variation across functional systems.
The correlation heatmap revealed a coherent internal architecture within the dataset. Strong positive correlations among neurodevelopmental domains—intellectual disability, motor, mental, and neurological features—defined a unified neurocognitive functional cluster, indicating that neuronal dysfunction underlies both cognitive and motor outcomes. The observed association between perinatal complications and early disease onset underscores the modifying role of early-life biological stress on phenotypic expression. In contrast, sensory (vision) and neuroimaging (MRI) domains exhibited weak correlations with other traits, suggesting modular or genotype-specific effects rather than generalized severity.
HCA stratified patients into three major clusters, each reflecting distinct severity profiles and phenotypic compositions. Group 1 encompassed individuals with the most severe neurodevelopmental impairment and multiple coexisting systemic abnormalities. Group 2 comprised patients with moderate delays and partial functional preservation, whereas Group 3 primarily included milder cases with localized or isolated deficits. This stratification validates the semi-quantitative scoring model and demonstrates its capacity to recapitulate clinically meaningful subgroups.
Integration with pathway-level data reinforced these clinical groups. The combined heatmap of Z-scored phenotypic traits and annotated molecular pathways revealed that patients in the severe cluster were enriched for variants affecting ion channel function, myelination, metabolic processes, and synaptic signaling—mechanisms central to neuronal excitability and brain development. In contrast, milder clusters were more frequently associated with transcriptional or RNA-processing pathways. This gradient, from metabolic–synaptic to transcriptional mechanisms, mirrors the transition from early developmental to regulatory dysfunction, highlighting the biological coherence of the clustered phenotypes.
Collectively, these findings underscore the utility of multidimensional, semi-quantitative, and unsupervised methods in elucidating clinical–genetic complexity. By leveraging inter-domain correlations and hierarchical structure, these approaches enable the identification of phenotypic convergence patterns that univariate analyses may overlook. Moreover, the observed modularity supports the use of pathway-based modeling to refine variant interpretation and guide future functional studies.
Our study is consistent with recent Vietnamese and international research applying hierarchical clustering, correlation networks, and machine-learning–based dimensionality reduction to elucidate genotype–phenotype interactions in complex genetic disorders, including hepatocellular carcinoma, ischemic stroke, and viral infections. The reproducibility of these data-driven approaches across diverse biological contexts underscores their robustness and translational relevance in precision genomics.
The study sample size was limited and derived from a single center; however, the cohort included patients from diverse ethnic and geographic backgrounds across Vietnam. Variants located in non-coding regions, cryptic balanced chromosomal abnormalities, and other genomic alterations undetectable by ES or CES may have been missed. In certain cases (e.g., CNV8), the familial origin of the mutation could not be determined. Future studies should focus on expanding cohort size and incorporating patients from multiple centers. The application of whole-genome sequencing (WGS) and long-read sequencing technologies is recommended to broaden the genomic landscape assessed and enhance the detection of complex or previously undetectable variants. These strategies will ultimately improve diagnostic accuracy and enable more personalized treatment approaches for neurodevelopmental disorders.
In summary, this multidimensional framework delineates clinically coherent subgroups among ID/ASD patients and bridges the gap between genetic mechanisms and phenotypic outcomes. Future extensions incorporating trio-based sequencing, methylation profiling, and longitudinal data will further refine genotype–phenotype mapping, advancing precision diagnosis and personalized management in neurodevelopmental genetics.