1. Introduction
The non-classical major histocompatibility complex (MHC) class I chain-related protein A (MICA) and MHC class I chain-related protein B (MICB) are stress-inducible ligands for the activating receptor NKG2D, expressed on NK cells, γδ T cells, and subsets of CD8
+ αβ T cells. Their upregulation on infected, transformed, or otherwise stressed cells provides a central pathway for “induced self” recognition and immune surveillance, complementing the classical “missing self” paradigm based on loss of HLA class I molecules [
1,
2,
3]. Extensive experimental work has shown that NKG2D–MICA/MICB interactions can promote cytotoxic responses and cytokine production, but that tumor cells and viruses can evade this axis through downregulation or proteolytic shedding of MICA/B from the cell surface [
4,
5,
6,
7]. These biological properties place MICA and MICB at the interface of innate and adaptive immunity, with relevance for cancer, infection, autoimmunity, and transplantation.
Both MICA and MICB are encoded within the human MHC on chromosome 6 and display substantial allelic polymorphism, albeit with a more restricted pattern than classical HLA class I loci [
8,
9,
10]. Population studies from different geographic regions have documented dozens of MICA and MICB alleles, often in non-random association with HLA-B and other class I loci, reflecting the shared evolutionary history of this genomic segment. For example, Brazilian renal transplant candidates and controls show marked MICA allele diversity and linkage disequilibrium (LD) between selected MICA and HLA-B alleles [
8], while a Bulgarian cohort recently demonstrated 36 MICA and 16 MICB alleles with strong LD between HLA-B and both MICA and MICB [
10]. High-throughput studies based on hybrid-capture Next-generation sequencing (NGS) typing of donor registries and large cohorts have further expanded the catalog of observed MICA/B alleles and clarified their global frequency patterns, including documentation of alleles present in IPD-IMGT/HLA but not yet seen in millions of typed samples [
9]. Despite this progress, detailed characterization of MICA/MICB diversity remains sparse for several European regions, including the Iberian Peninsula. The human MHC region is characterized by extended conserved haplotypes spanning multiple megabases, in which specific combinations of classical and non-classical loci are inherited together across generations [
11]. These ancestral haplotypes contribute to non-random associations between HLA-B, MICA, and MICB alleles and may influence immune recognition beyond individual loci [
12,
13,
14]. Studying MICA–MICB combinations within this framework therefore reflects broader genomic architecture rather than isolated gene variation.
Beyond basic population genetics, MICA and MICB have attracted attention in transplantation. Anti-MICA antibodies have been associated with increased risk of rejection and reduced graft survival in kidney transplantation, although not all studies are concordant. In a pivotal New England Journal of Medicine study, the presence of anti-MICA antibodies was associated with lower 1-year graft survival and higher incidence of acute rejection [
15]. Subsequent reports and commentaries have confirmed a potential role for anti-MICA (and, to a lesser extent, anti-MICB) as non-HLA donor-specific antibodies contributing to allograft injury, while also highlighting heterogeneity between cohorts and methodologies [
16,
17,
18]. These observations highlight the clinical importance of MICA/MICB. However, interpreting mismatches requires solid population-level data.
In allogeneic hematopoietic stem cell transplantation (HSCT), MICA and MICB are increasingly recognized as non-classical histocompatibility antigens and as key modulators of NK-cell and T-cell alloreactivity. Because NK cells and cytotoxic T cells are among the first lymphocyte subsets to reconstitute after HSCT, variation in the NKG2D–MICA/MICB axis can influence both graft-versus-leukemia and graft-versus-host responses. Several studies have shown that donor–recipient mismatches at MICA, particularly at the functional MICA-129 Met/Val dimorphism, are associated with the incidence and severity of acute and chronic graft-versus-host disease (GVHD), transplant-related mortality, and overall survival, and that matching for MICA-129 can improve outcomes in otherwise HLA-matched unrelated donor HSCT [
19,
20,
21,
22,
23]. Single-nucleotide polymorphisms and mismatches in MICB, as well as variability in soluble MICA/MICB levels, have likewise been linked to post-transplant complications and GVHD in allogeneic HSCT recipients [
24,
25,
26]. In addition, preformed anti-MICA antibodies have been reported as relevant non-HLA antibodies in the HSCT setting, especially in HLA-mismatched transplants, and may contribute to graft failure and other adverse events [
27,
28]. These observations indicate that MICA and MICB are not only important stress ligands in tumor and infection biology but also clinically relevant determinants of outcome in hematopoietic stem cell and progenitor transplantation.
The analytical landscape has changed substantially with the introduction of capture-based NGS typing platforms such as AlloSeq Tx17 (CareDx), which use hybrid capture rather than long-range PCR, offer full-gene coverage of classical HLA loci, and include non-classical genes such as HLA-E, HLA-F, HLA-G, HLA-H, MICA, and MICB in a single workflow [
29]. This technology enables high-resolution MICA/MICB typing in routine histocompatibility laboratories and has already been used to generate comprehensive allele-frequency datasets for HLA-G, HLA-F, MICA, MICB, and related loci [
30]. However, most analyses of these data still rely on classical summaries: allele frequencies, heterozygosity, pairwise LD, and simple haplotype tables, without fully exploiting the structure of multilocus genotype data.
In parallel, artificial intelligence (AI) and machine learning (ML) are increasingly applied across transplantation and immunogenetics. ML models have been used in compatibility and donor/recipient matching, predict crossmatch compatibility, and identify pre-transplant risk factors for de novo HLA-specific antibody development [
31,
32,
33,
34]. Recent work has demonstrated that ML-enhanced immunologic risk prediction can improve performance over conventional rules-based algorithms by integrating detailed HLA allele data, antibody profiles, and clinical covariates [
31,
35]. Reviews focused on kidney transplantation have highlighted the potential of AI and ML to support allocation, monitoring, and individualized immunosuppression, highlighting the importance of integrating immunogenetic information, including non-classical loci, into such models [
32]. Although NGS genotyping produces high-dimensional data, MICA/MICB diversity has rarely been explored using unsupervised ML approaches.
Most published MICA/MICB population studies have limitations. First, sample sizes are often modest and focused on specific patient groups or donor registries, which may not reflect the broader regional population structure [
8,
10]. Second, the analyses typically stop at one- or two-locus summaries, providing limited insight into the global architecture of MICA–MICB genotype constellations across individuals. Third, advanced exploratory tools such as clustering based on high-dimensional allele presence/absence matrices, manifold learning (principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP)), and network-based descriptions of co-occurrence, have rarely been applied to non-classical MHC loci, even though they are now standard in other areas of genomics and single-cell biology. As a result, there is still an incomplete picture of how MICA and MICB polymorphism is organized at the population level and how distinct genotype backgrounds might be recognized or targeted in clinical and research settings.
From a clinical perspective, donor selection increasingly extends beyond classical HLA matching. Non-HLA immunogenetic factors such as MICA and MICB mismatches and anti-MICA antibodies have been associated with graft rejection and graft-versus-host disease in both solid organ transplantation and HSCT. However, interpretation of these mismatches requires population-specific baseline information. Without knowledge of which MICA–MICB combinations are common, rare, or part of conserved genetic backgrounds, it is difficult to distinguish clinically meaningful incompatibilities from expected variation. Therefore, the practical goal of this study is not merely descriptive but to provide a population reference framework that allows histocompatibility laboratories to contextualize non-classical mismatches, prioritize potentially relevant combinations, and support future risk-prediction algorithms incorporating non-HLA loci.
The present study addresses this gap by analyzing a large cohort from southern Portugal, typed at high resolution for MICA and MICB using a hybrid-capture NGS workflow. The analysis combines classical population-genetic metrics (allele and genotype frequencies, heterozygosity, Hardy–Weinberg equilibrium (HWE), and LD) with unsupervised ML techniques (k-means and spectral clustering, PCA, t-SNE, UMAP, Jaccard-based similarity mapping, and network analysis) to provide an integrated view of MICA/MICB population structure. In addition to characterizing allele distributions in an Iberian population that is underrepresented in the MICA/MICB literature, the study maps recurrent MICA–MICB genotypes and identifies clusters that could inform future association studies in transplantation, infection, and cancer immunology.
2. Materials and Methods
2.1. Study Design and Anonymized Dataset
This retrospective analysis used a fully anonymized dataset of MICA and MICB genotypes obtained from routine histocompatibility testing in southern Portugal. No personal identifiers, clinical variables, dates, or other information enabling re-identification were available to the investigators, and no additional procedures or contacts with patients were involved. According to institutional policy and national regulations, studies based solely on non-identifiable data generated in routine care are exempt from formal Ethics Committee review and from the requirement for individual informed consent. The study was nevertheless conducted in line with the principles of the Declaration of Helsinki, with particular emphasis on data minimization, confidentiality, and the protection of participants’ rights and welfare.
2.2. Sample Collection and DNA Extraction
Peripheral blood was collected in EDTA tubes as part of routine histocompatibility testing. Genomic DNA was isolated in the local tissue typing laboratory using standard procedures based on either salting-out or silica-membrane column purification, following the manufacturers’ instructions. DNA quantity and purity were assessed spectrophotometrically, and samples meeting the minimal requirements for NGS library preparation were stored at −20 °C until use.
2.3. High-Resolution MICA and MICB Typing
High-resolution typing of MICA and MICB was performed as part of a capture-based NGS HLA typing workflow using the AlloSeq Tx solution (AlloSeq Tx17, CareDx, Brisbane, CA, USA). AlloSeq Tx is an NGS-based typing system that uses hybrid capture technology instead of long-range PCR, enabling full-gene coverage of classical HLA loci and several additional transplant-associated genes, including HLA-E, HLA-F, HLA-G, HLA-H, MICA, and MICB.
Library preparation followed the manufacturer’s protocol. Briefly, genomic DNA (≥50 ng per sample) was fragmented and tagged in a single-tube tagmentation step, followed by index PCR with early barcoding to minimize sample-switching risk. After bead-based normalization and pooling, libraries were enriched by hybridization with a biotinylated probe panel targeting full HLA genes and the additional non-classical loci and subsequently captured using streptavidin-coated magnetic beads. Enriched, indexed libraries were quantified, pooled, and sequenced on an Illumina NGS platform.
Allele assignment was performed using the AlloSeq Assign software (version 1.0.6.1409) and the IMGT/HLA (version 3.57.0.0 8 July 2024). Two high-resolution alleles at MICA and two at MICB were available for each individual. Genotypes were treated as diploid and unordered, with homozygous configurations coded as A/A and heterozygous configurations coded as A/B. Consequently, all analyses reflect diploid genotype structure, respecting the biological inheritance that one MICA allele and one MICB allele originate from each progenitor.
This principle was respected in all downstream calculations, including:
allele-frequency computation (2N alleles per locus);
diversity indices;
Hardy–Weinberg equilibrium tests;
multilocus heterozygosity assessment;
co-occurrence and LD estimation;
construction of combined MICA–MICB genotype configurations;
machine-learning feature encoding (presence/absence of alleles);
haplotype-level visualization of recurrent genotype configurations.
2.4. Data Curation and CODING Strategy
Raw genotype fields were harmonized prior to analysis by enforcing consistent use of asterisks and colons, trimming whitespace, and converting ambiguous or missing values to “NA”. Alleles annotated with the “Q” suffix (e.g., MICA*011:01:04Q, MICA*011:01Q) were retained as distinct specificities, following IPD-IMGT/HLA nomenclature, and were not collapsed with the corresponding non-Q alleles. The Q flag was treated as an annotation of potential sequence ambiguity but did not modify the underlying allele coding used for frequency estimates, LD calculations, or clustering and t-SNE/UMAP visualizations. For allele-level analyses, each individual contributed two allele copies per locus (2N alleles). For genotype-level analyses, the two alleles at MICA and the two alleles at MICB were retained as diploid unordered genotypes. For multilocus analyses, MICA and MICB genotypes were combined into four-allele genotype constellations.
2.5. Population-Genetic Analyses
Allele frequencies were calculated from 2N allele copies per locus, where N is the number of individuals. For each locus, each genotype contributed two allele copies to the frequency estimates. Diversity metrics (Shannon entropy (H), and Pielou evenness (H/log k, where k is the number of observed alleles)) were computed for each locus. Observed heterozygosity was calculated per locus as the proportion of individuals carrying two different alleles, and multilocus heterozygosity as the number of heterozygous loci (0, 1, or 2) per individual. HWE was evaluated for the most frequent alleles at each locus using chi-square testing applied to AA, Aa and aa genotype counts considered as a bi-allelic system.
2.6. Inter-Locus Association and Linkage Disequilibrium
Statistical dependence between MICA and MICB alleles was evaluated at the individual level. For selected frequent alleles, co-occurrence frequencies were calculated and visualized as a heatmap. Classical linkage disequilibrium parameters (D, D′ and r2) were derived from carrier frequencies:
p(A): carrier frequency of A
p(B): carrier frequency of B
p(AB): frequency of individuals carrying both A and B.
Allele pairs exhibiting the strongest r2 values were summarized. Because multiple MICA–MICB allele pairs were evaluated, the LD statistics are reported as descriptive measures of effect size rather than as formally multiplicity-adjusted hypothesis tests. Interpretation therefore focuses on associations with large magnitude (e.g., high |D′| and r2) that are robust to multiple-testing considerations, whereas weaker signals are treated as exploratory.
2.7. Combined MICA–MICB Genotype Constellations
All distinct four-allele MICA–MICB genotype constellations were enumerated. Rare constellations (carried by single individuals) were retained for frequency tabulation but only recurrent configurations (defined a priori as occurring in ≥3 individuals) were used in exploratory machine-learning visualizations and cluster maps.
2.8. Machine-Learning Encoding and Clustering
For unsupervised exploration of allele-distribution structure, a binary presence/absence matrix was constructed across individuals. Each distinct allele represented a feature; the feature value was set to 1 when present and 0 otherwise, yielding a sparse binary feature space.
k-means clustering was applied to this matrix with Euclidean distance. The optimal number of clusters was selected by analysis of silhouette scores. As a complementary method, spectral clustering was applied to a similarity graph based on pairwise Jaccard indices.
2.9. Dimensionality-Reduction and Manifold Projection
To visualize multilocus structure, three complementary manifold-learning methods were used:
Principal Component Analysis;
t-distributed Stochastic Neighbor Embedding;
Uniform Manifold Approximation and Projection.
Cluster membership, allele counts, and heterozygosity were overlaid on the low-dimensional embedding space to aid biological interpretation.
2.10. Haplotype-Level Low-Dimensional Mapping
Combined MICA–MICB genotype constellations occurring in ≥3 individuals were encoded and embedded in two dimensions using t-SNE. k-means clustering was then performed in the t-SNE space. Voronoi-style decision regions were plotted to delineate haplotypic “territories”, facilitating interpretation of recurrent genotype backgrounds.
2.11. Similarity Metrics and Hierarchical Clustering
Pairwise Jaccard similarity coefficients were computed between individuals based on allele presence/absence. The resulting similarity matrix was displayed as a heatmap. A dissimilarity matrix (1—Jaccard) was used for agglomerative hierarchical clustering (average linkage), and the corresponding dendrogram was examined for grouping patterns.
2.12. Software Environment and Reproducibility
All analyses were performed in Python (version 3.13), executed in a reproducible scripted environment. The following major libraries were used:
pandas (data structuring and manipulation)
NumPy (vectorized numeric operations)
SciPy (statistical functions)
scikit-learn (k-means, spectral clustering, PCA, t-SNE)
UMAP-learn (UMAP projections)
matplotlib/seaborn (visualization)
Scripts were version-controlled and executed without manual intervention to guarantee reproducibility. Intermediate and final outputs consisted of machine-readable CSV tables and high-resolution figures.
3. Results and Discussion
The final dataset comprised 1364 unrelated individuals from southern Portugal typed at high resolution for MICA and MICB using a capture-based NGS workflow. Forty-two distinct MICA alleles and twenty-two MICB alleles were identified (presented in
Table 1). At the MICA locus, the most frequent alleles were MICA*002:01 and MICA*008:01, each accounting for approximately 16.9% of all observed MICA allele copies in the cohort, followed by MICA*004:01 (14.3%), MICA*008:04 (10.0%) and MICA*009:01 (8.3%). At MICB, MICB*005:02 predominated and represented 41.2% of all MICB allele copies, with MICB*002:01 (20.9%), MICB*004:01 (15.2%), MICB*008:01 (9.2%) and MICB*005:03 (3.8%) forming a second tier of common variants.
Figure 1A,B shows the dominance of a few alleles.
These spectra are broadly concordant with reports from other European cohorts typed by NGS, where alleles of the MICA*002, MICA*004, MICA*008 and MICB*005, MICB*004, MICB*002 groups also dominate, although the exact rank order and contribution of low-frequency alleles vary across regions. In a German donor registry of more than two million individuals, MICA*008, MICA*002 and MICA*009, and MICB*005, MICB*004 and MICB*002 were the most frequent alleles, mirroring the pattern observed here but with substantially higher frequencies for MICA*008 and MICB*005 in that population [
9]. A recent Finnish analysis likewise confirmed extensive MICA and MICB polymorphism with notable regional frequency differences within Europe [
36]. The present data extend this picture by providing a detailed allele-frequency reference for a southern Portuguese population, a region that has been underrepresented in non-classical MHC class I population studies and may be directly useful for histocompatibility laboratories in the Iberian area, that have presented only small studies [
30].
Diversity metrics reinforced the impression that MICA is more polymorphic than MICB in this cohort (presented in
Table 2). Shannon entropy was higher for MICA (H ≈ 2.7) than for MICB (H ≈ 1.8), and the expected heterozygosity (He) likewise indicated a broader and more even distribution of MICA alleles. This pattern is consistent with prior observations that MICA exhibits particularly rich allelic diversity in Europe and South America [
8,
9], whereas MICB, although clearly polymorphic, tends to show a more concentrated spectrum dominated by a few common alleles. From an evolutionary perspective, such a contrast is compatible with a combination of balancing selection on ligand–receptor interactions, hitch-hiking with classical HLA haplotypes, and demographic processes shaping the non-classical MHC region. Recent large-scale imputation work has also highlighted that MICA, MICB, HLA-E, HLA-F and HLA-G carry substantial information content beyond classical HLA loci and can now be inferred accurately in biobank-scale datasets, further underscoring their relevance for population and disease studies [
37].
Observed heterozygosity was high at both loci, but again higher for MICA (≈0.90) than for MICB (≈0.74) (
Table 2). At the individual level, multilocus heterozygosity exemplified the richness of non-classical MHC variation in this population: 932 individuals (68.3%) were heterozygous at both loci, 371 (27.2%) were heterozygous at one locus only, and only 61 (4.5%) were homozygous at both MICA and MICB (
Figure 2). The same information is summarized numerically in
Table 3. Thus, most individuals express two distinct MICA allotypes and two distinct MICB allotypes, creating considerable diversity in the repertoire of NKG2D ligands at the cellular surface. This multilocus diversity is biologically plausible for stress-inducible ligands involved in immune surveillance and may be particularly relevant in settings where NK cells and cytotoxic T cells play central roles, such as viral infection, tumor immunology and transplantation. In allogeneic HSCT, where NK cells and CD8
+ T cells reconstitute early and NKG2D engagement can influence both graft-versus-leukemia and graft-versus-host disease (GVHD), the presence of multiple MICA and MICB allotypes per individual is of special interest because it broadens the possible ligand landscape experienced by donor-derived effector cells [
21,
25,
26,
38].
HWE analyses for the most frequent alleles did not reveal strong deviations from equilibrium expectations. For example, the genotype distributions of MICA*002:01 and MICB*005:02 were compatible with HWE under a panmictic population model, arguing against gross genotyping artifacts or very strong selection acting on single alleles in this cohort (
Table 4). The absence of marked HWE departures at the allele level does not exclude more subtle selection on extended haplotypes, cis-regulatory variants or amino-acid motifs, but it provides reassurance regarding the technical robustness and population coherence of the dataset. This is important because artifacts in NGS-based typing of non-classical loci remain a concern in some settings; recent work has emphasized the need for careful assay validation and curated reference panels for MICA and MICB.
The relationship between MICA and MICB polymorphism was further examined through allele co-occurrence patterns and LD-like statistics. A heatmap summarizing the co-occurrence of the fifteen most frequent alleles at each locus (
Figure 3 and
Supplementary Table S1) revealed several non-random combinations at the individual level.
LD metrics showed strong associations for common MICA–MICB pairs, including MICA*009:02–MICB*005:06 (D′ ≈ 0.94, r
2 ≈ 0.69), MICA*016:01–MICB*005:01 (r
2 ≈ 0.42), MICA*008:04–MICB*004:01 (r
2 ≈ 0.16) and MICA*008:01–MICB*008:01 (r
2 ≈ 0.12) (
Figure 4). These high r
2 values indicate conserved haplotypic backgrounds where specific MICA and MICB alleles are inherited simultaneously much more frequently than expected under random assortment. Similar non-random associations between MICA, MICB and HLA-B have been documented in Brazilian, Bulgarian and Finnish cohorts and are thought to reflect the shared evolutionary history of this segment of chromosome 6 [
8,
10,
36]. From a practical perspective, such conserved haplotypes are useful for imputation and can inform interpretation of non-classical mismatches in donors typed only at classical HLA loci. Such strong MICA–MICB associations are also consistent with the broader architecture of conserved extended haplotypes in the MHC, in which classical and non-classical loci can be co-inherited across megabase-length segments [
11].
At the genotype level, 606 distinct four-allele MICA–MICB genotype constellations (unordered MICA genotype plus unordered MICB genotype) were observed among the 1364 individuals. As expected for highly polymorphic loci, the distribution was long-tailed: many constellations occurred only once or twice, whereas a limited subset of recurrent backgrounds accounted for a substantial fraction of all genotypes. The two most frequent constellations were MICA*004:01/MICA*008:04 with MICB*004:01/MICB*005:02 and MICA*004:01/MICA*008:01 with MICB*005:02/MICB*005:02, each present in seventeen individuals. Several other constellations combining MICA*002:01, MICA*004:01 or MICA*008:01 with MICB*002:01 and/or MICB*005:02 were also prominent. These recurrent genotype backbones represent the dominant NKG2D-ligand contexts expected in donors and recipients from this region and form a natural reference for future association studies in solid organ transplantation, HSCT and infection.
To look beyond single-locus frequency summaries, individual genotypes were encoded as a binary presence/absence matrix across all observed MICA and MICB alleles, and unsupervised machine-learning methods were applied to this high-dimensional space. K-means clustering was performed with k between 2 and 7, and the average silhouette coefficient, a standard measure of cluster separation, suggested k = 3 as a reasonable compromise between separation and interpretability (average silhouette ≈ 0.14;
Figure 5). Numeric silhouette scores for k values from 2 to 7 are provided in
Supplementary Table S5. The silhouette value is modest, but this is typical for complex genetic datasets with overlapping groups, and it indicates that the data are better described as a continuous structure with areas of higher density rather than sharply separated subpopulations.
The final k-means partition produced three clusters of comparable size (430, 498 and 436 individuals). Each cluster displayed a characteristic, although overlapping, allelic signature, with several common alleles (for example MICA*008:01 and MICB*005:02) contributing substantially to more than one cluster (
Table 5). One cluster was enriched for MICA*008:01, MICA*004:01 and MICA*002:01 together with MICB*005:02; a second was characterized by MICA*002:01 and MICA*009:01 in combination with a high frequency of MICB*002:01; and a third cluster showed increased frequencies of MICA*008:01, MICA*008:04, MICA*004:01 and MICA*009:02, accompanied by an MICB profile enriched in MICB*004:01, MICB*008:01, MICB*003:01, MICB*005:06 and MICB*005:03. These clusters likely represent overlapping genetic backgrounds rather than discrete subgroups, consistent with the complex LD structure of the MHC.
Dimensionality-reduction techniques provided complementary views of this structure. PCA positioned individuals in a two-dimensional space where the three k-means clusters occupied partially distinct, but overlapping, regions along the first two principal components (
Figure 6). Non-linear methods, namely t-SNE and UMAP, emphasized local neighborhood relationships and revealed more clearly delineated “clouds” corresponding to each cluster, with the cluster enriched for genotypes carrying MICA*009:02 and MICB*005:06 tending to occupy a more peripheral position in the embedding space (
Figure 7 and
Figure 8). These manifold-learning maps visualize the MICA/MICB genotype landscape, revealing regions of higher density and gradations between common haplotypic backgrounds. Similar visual strategies have been applied successfully to other high-dimensional immunogenetic datasets, including single-cell transcriptomic and T-cell receptor repertoires, to reveal structure that is not obvious in raw categorical data.
As a sensitivity analysis, spectral clustering based on a nearest-neighbor Jaccard similarity graph was also applied. This method identified one large group of 1344 individuals and two very small clusters of ten individuals each. Exact cluster sizes for both k-means and spectral clustering are summarized in
Supplementary Table S6. The very small cluster sizes suggest that these sets most likely correspond to rare or unusual haplotypic configurations at the edges of the distribution, rather than to biologically distinct subpopulations. Given their exploratory nature and limited interpretability, the detailed spectral clustering results are best presented as
Supplementary Material, while the k-means-based partition and the associated PCA/t-SNE/UMAP maps remain the focus of the main text.
To examine similarity among individuals in more detail, pairwise Jaccard similarity was computed from the allele presence/absence matrix. The resulting similarity heatmap and associated hierarchical clustering (
Figure 9,
Figure 10 and
Figure 11) revealed blocks of individuals sharing highly similar MICA/MICB repertoires, broadly corresponding to the k-means clusters but also containing smaller subgroups driven by particular alleles or allele combinations.
No isolated clusters of individuals were apparent, again arguing against strong hidden substructure within the cohort. In parallel, an allele–allele Jaccard similarity heatmap (
Figure 12) grouped MICA and MICB alleles that tend to co-occur more frequently than expected by chance. The corresponding edge list for the allele co-occurrence network, which is particularly useful for network-based visualization in software such as Cytoscape (version 3.10.3), is most appropriately provided as
Supplementary Table S3.
From a clinical perspective, clustering and similarity metrics are not intended to define biological subpopulations but to quantify immunogenetic proximity between individuals. In transplantation, compatibility is rarely binary; rather, recipients may differ from donors by common or unusual genetic backgrounds. Clustering and Jaccard similarity provide a framework to measure whether a donor–recipient pair shares a typical regional MICA–MICB background or represents an uncommon combination potentially associated with increased alloreactivity [
33]. These approaches may therefore support future allocation algorithms or risk-stratified analyses rather than immediate clinical decision rules.
Recurrent four-allele MICA–MICB genotype constellations observed in at least three individuals were further examined at the haplotype-constellation level using t-SNE (
Figure 13). In this analysis, each point corresponded to a distinct MICA/MICB genotype constellation, and distances in the two-dimensional map reflected similarity in the underlying allele sets. The central portion of the map was populated by frequent constellations built around MICA*002:01, MICA*004:01, MICA*008:01, MICB*005:02 and MICB*002:01, whereas more peripheral regions were occupied by constellations containing rarer alleles such as MICA*009:02 and MICB*005:06. K-means clustering in the t-SNE plane identified several haplotype-constellation clusters (
Supplementary Table S4), and decision-region plots helped visualize how these clusters tile the genotype space. Conceptually, this representation offers a compact map of the non-classical MHC genotype landscape: donor–recipient pairs lying within the same region of the map are expected to share closely related MICA/MICB backgrounds, whereas pairs located in different regions are more likely to differ qualitatively in NKG2D-ligand context.
A complementary view of genotype organization was obtained by tabulating and visualizing the most common MICA and MICB genotypes in a two-dimensional matrix (
Figure 14). This matrix highlighted a limited set of highly recurrent genotype combinations, mostly involving MICA*002:01, MICA*004:01, MICA*008:01 and MICB*002:01, MICB*004:01, MICB*005:02, which accounted for a large proportion of the cohort. Rare genotype combinations tended to appear in sparsely populated rows and columns, echoing their peripheral position in the t-SNE map. This dual representation (clustered matrix and manifold embedding) makes it easier to identify genotype backgrounds that may deserve specific attention in future association analyses.
These findings have implications for both solid organ transplantation and allogeneic HSCT. In kidney transplantation, anti-MICA antibodies, often targeting epitopes present on common alleles such as MICA*008, have been associated in several cohorts with acute rejection, chronic allograft dysfunction and inferior graft survival, even after adjustment for classical HLA donor-specific antibodies, although not all studies have been consistent. The landmark New England Journal of Medicine study by Zou and colleagues reported lower one-year graft survival and higher rejection rates in recipients with anti-MICA antibodies. More recent work continues to support a role for anti-MICA, and potentially anti-MICB, as non-HLA donor-specific antibodies contributing to allograft injury, particularly when combined with HLA donor-specific antibodies. Knowledge of which MICA and MICB alleles and combined genotypes dominate in a given population is crucial for interpreting non-HLA antibody profiles, designing bead panels that adequately cover local polymorphism, and quantifying the incremental predictive value of anti-MICA/MICB antibodies in graft-outcome models [
15].
In HSCT and hematopoietic progenitor transplantation, MICA and MICB have emerged as non-classical histocompatibility determinants acting through the NKG2D axis. Mismatches at MICA, particularly involving the functional MICA-129 Met/Val dimorphism that modulates NKG2D signaling strength, have been associated with acute and chronic GVHD, relapse and survival in HLA-matched unrelated donor HSCT. Matching for MICA-129 has been reported to improve outcomes in otherwise 10/10 HLA-matched transplants, and recent data from large European cohorts and meta-analyses confirm that MICA-129 mismatches can act as independent risk factors for inferior disease-free survival. MICB mismatches have also begun to receive attention, with emerging evidence that, in 9/10 HLA-matched settings, MICB mismatches may worsen disease-free and GVHD- and relapse-free survival. The present population map does not incorporate outcome data, but it provides a detailed haplotypic and multilocus framework against which future HSCT studies in this region can be interpreted. In particular, it identifies the dominant MICA–MICB genotype constellations, quantifies the frequency of alleles known or suspected to be functionally important (including those encoding MICA-129 variants), and shows how these constellations cluster in genotype space. This information can support study designs that examine whether specific non-classical haplotypic backgrounds or cluster membership are associated with GVHD, relapse, infection or transplant-related mortality [
19,
20,
22].
Combining classical population genetics with unsupervised machine learning reveals structure often missed by standard descriptive statistics. The clustering, similarity mapping and t-SNE/UMAP visualizations do not replace hypothesis-driven association tests, but they provide a global view of the MICA/MICB genotype landscape, highlight features such as conserved haplotypes and dominant backgrounds, and suggest biologically plausible partitions that may be useful for stratification in future clinical or functional analyses. Comparable computational pipelines are increasingly used to integrate HLA data, antibody profiles and clinical covariates into ML-based risk prediction models for solid organ and HSCT outcomes, and the present work shows how these concepts can be extended to non-classical MHC class I ligands.
Several limitations should be acknowledged. First, the cohort originates from a single histocompatibility laboratory in southern Portugal and, although numerically robust, may not fully capture the diversity of the broader Portuguese or Iberian populations. Explicit ancestry information and replication in additional centers would strengthen generalizability. Second, the dataset was intentionally restricted to fully anonymized genotypes without demographic, clinical or outcome data, so the analyses remain descriptive and cannot address associations with rejection, GVHD, relapse, infection or survival. Third, unsupervised algorithms are sensitive to parameter choices and to the curse of dimensionality; the moderate silhouette scores and reliance on two-dimensional embeddings for visualization mean that clusters and maps should be interpreted as heuristic summaries rather than definitive subpopulation boundaries. Fourth, although the typing platform also generates classical HLA data, phase-resolved extended MHC haplotypes could not be reconstructed because the dataset was anonymized and lacked family or segregation information. Therefore, the present analysis focuses on allele co-occurrence rather than formally defined ancestral haplotypes. In future multi-center datasets with appropriate phasing strategies (e.g., family-based data or statistical phasing), extended MHC haplotypes integrating classical HLA loci and non-classical genes from the same platform could be reconstructed, enabling direct testing of clinical associations. In addition, only MICA and MICB were interrogated in depth, even though the underlying NGS platform also provides high-resolution data for HLA-E, HLA-F, HLA-G and other loci that participate in NK-cell and T-cell regulation, and emerging data suggest that structural variants such as MICA deletions may introduce additional layers of complexity.
From a real-world perspective, the present dataset is best used to complement, rather than replace, classical HLA matching. Its most immediate value is to support targeted, case-based use of MICA/MICB typing and non-HLA antibody interpretation and to guide the development and validation of more detailed histocompatibility assays. In practice, three workflows are particularly realistic:
(i) Non-HLA antibody workup in solid organ transplantation: when anti-MICA and/or anti-MICB reactivity is detected (or suspected in the setting of antibody-mediated injury not fully explained by HLA-DSA), laboratories can use local allele and genotype frequency information to interpret whether the implicated specificities represent common regional allotypes versus rare variants and to prioritize confirmatory testing and panel coverage accordingly. When donor typing is available, this enables a structured “non-HLA virtual crossmatch” by determining whether the donor expresses the implicated allotype/epitope and by documenting the mismatch context for multidisciplinary risk discussions and post-transplant monitoring strategies;
(ii) Selected HSCT donor evaluation when donors are otherwise comparable by classical HLA: in centers that already consider non-classical determinants, the recurrent multilocus constellations and similarity structure provide a practical way to judge whether a candidate donor shares a typical regional MICA/MICB background with the recipient or represents a less common constellation that may warrant closer evaluation in outcome-linked studies;
(iii) Diagnostic development and quality assurance: for histocompatibility laboratories and diagnostic manufacturers, the restricted set of common alleles and recurrent genotype backbones offers an evidence-based basis to assess population coverage of bead content, design validation panels, and prioritize which specificities are essential to include for Iberian recipients. Because commercial diagnostics serve international populations, we emphasize that these frequencies should be interpreted as a southern Portuguese/Iberian reference and should be complemented by analogous datasets from other populations for worldwide assay design; however, the analytical framework presented here is directly transferable to such multi-population efforts.
Despite these limitations, our results provide a detailed view of MICA and MICB diversity in a Southern European population. The combination of allele-frequency analysis, diversity metrics, LD mapping, multilocus genotype enumeration, individual-level similarity structure and haplotype-level manifold representations provides a detailed description of the non-classical MHC class I ligand landscape. This framework can assist histocompatibility laboratories in interpreting non-HLA antibodies and in identifying clinical scenarios where MICA/MICB typing is most actionable (e.g., anti-MICA/anti-MICB sensitization or selected HSCT settings), and it can support future integrated risk models that incorporate non-classical MHC determinants alongside classical HLA and clinical covariates.