Integrative Population Analysis of MICA and MICB Using Unsupervised Machine Learning in a Large Histocompatibility Laboratory Cohort

Ramalhete, Luis; Almeida, Paula; Araújo, Ruben; Espada, Eduardo

doi:10.3390/j9010008

Open AccessArticle

Integrative Population Analysis of MICA and MICB Using Unsupervised Machine Learning in a Large Histocompatibility Laboratory Cohort

by

Luis Ramalhete

^1,2,3,*

,

Paula Almeida

¹,

Ruben Araújo

²

and

Eduardo Espada

^1,4

¹

Blood and Transplantation Center of Lisbon, Instituto Português do Sangue e da Transplantação, Alameda das Linhas de Torres, nº 117, 1769-001 Lisboa, Portugal

²

NOVA Medical School, Faculdade de Ciências Médicas, Universidade NOVA de Lisboa, 1169-056 Lisbon, Portugal

³

iNOVA4Health—Advancing Precision Medicine, Núcleo de Investigação em Doenças Renais, NOVA Medical School, Faculdade de Ciências Médicas, Universidade NOVA de Lisboa, 1169-056 Lisbon, Portugal

⁴

Serviço de Hematologia, ULS Santa Maria, Hospital de Santa Maria, Av Prof. Egas Moniz, 1649-028 Lisboa, Portugal

^*

Author to whom correspondence should be addressed.

J 2026, 9(1), 8; https://doi.org/10.3390/j9010008

Submission received: 15 January 2026 / Revised: 19 February 2026 / Accepted: 3 March 2026 / Published: 6 March 2026

(This article belongs to the Special Issue Feature Papers of J—Multidisciplinary Scientific Journal in 2026)

Download

Browse Figures

Versions Notes

Abstract

Background: Non-classical MHC class I molecules MICA and MICB are stress-inducible NKG2D ligands that contribute to immune surveillance, non-HLA antibody formation, and alloreactivity in solid organ and hematopoietic stem cell transplantation; population-level data for Southern Europe remain limited. Methods: High-resolution MICA and MICB genotyping was performed in 1364 unrelated individuals from southern Portugal using a hybrid-capture next-generation sequencing workflow, and allele calls were analyzed with standard population-genetic metrics (allele and genotype frequencies, heterozygosity, Hardy–Weinsberg equilibrium, and LD-like D, D′, r²) and multilocus allele presence/absence encodings explored by k-means clustering, spectral clustering, principal component analysis, t-distributed stochastic neighbor embedding, and uniform manifold approximation and projection. Results: Forty-two MICA and twenty-two MICB alleles were identified; MICA*002:01, MICA*004:01, MICA*008:01, MICA*008:04 and MICB*002:01, MICB*004:01, MICB*005:02, MICB*008:01 were most frequent, and most individuals carried at least two distinct MICA and two distinct MICB allotypes. Co-occurrence and LD-like analyses revealed conserved MICA–MICB combinations, including a strong association between MICA*009:02 and MICB*005:06, while unsupervised analyses identified partially overlapping multilocus genotype backgrounds and recurrent four-allele constellations. Conclusions: These findings provide a detailed non-classical MHC reference for southern Portugal and a multilocus framework to support interpretation of non-HLA antibodies and MICA/MICB-aware donor evaluation in selected clinical scenarios, as well as the development of machine learning-based immunologic risk models.

Keywords:

MICA; MICB; NKG2D ligands; histocompatibility; population genetics; linkage disequilibrium; unsupervised machine learning; t-SNE; UMAP; clustering

1. Introduction

The non-classical major histocompatibility complex (MHC) class I chain-related protein A (MICA) and MHC class I chain-related protein B (MICB) are stress-inducible ligands for the activating receptor NKG2D, expressed on NK cells, γδ T cells, and subsets of CD8⁺ αβ T cells. Their upregulation on infected, transformed, or otherwise stressed cells provides a central pathway for “induced self” recognition and immune surveillance, complementing the classical “missing self” paradigm based on loss of HLA class I molecules [1,2,3]. Extensive experimental work has shown that NKG2D–MICA/MICB interactions can promote cytotoxic responses and cytokine production, but that tumor cells and viruses can evade this axis through downregulation or proteolytic shedding of MICA/B from the cell surface [4,5,6,7]. These biological properties place MICA and MICB at the interface of innate and adaptive immunity, with relevance for cancer, infection, autoimmunity, and transplantation.

Both MICA and MICB are encoded within the human MHC on chromosome 6 and display substantial allelic polymorphism, albeit with a more restricted pattern than classical HLA class I loci [8,9,10]. Population studies from different geographic regions have documented dozens of MICA and MICB alleles, often in non-random association with HLA-B and other class I loci, reflecting the shared evolutionary history of this genomic segment. For example, Brazilian renal transplant candidates and controls show marked MICA allele diversity and linkage disequilibrium (LD) between selected MICA and HLA-B alleles [8], while a Bulgarian cohort recently demonstrated 36 MICA and 16 MICB alleles with strong LD between HLA-B and both MICA and MICB [10]. High-throughput studies based on hybrid-capture Next-generation sequencing (NGS) typing of donor registries and large cohorts have further expanded the catalog of observed MICA/B alleles and clarified their global frequency patterns, including documentation of alleles present in IPD-IMGT/HLA but not yet seen in millions of typed samples [9]. Despite this progress, detailed characterization of MICA/MICB diversity remains sparse for several European regions, including the Iberian Peninsula. The human MHC region is characterized by extended conserved haplotypes spanning multiple megabases, in which specific combinations of classical and non-classical loci are inherited together across generations [11]. These ancestral haplotypes contribute to non-random associations between HLA-B, MICA, and MICB alleles and may influence immune recognition beyond individual loci [12,13,14]. Studying MICA–MICB combinations within this framework therefore reflects broader genomic architecture rather than isolated gene variation.

Beyond basic population genetics, MICA and MICB have attracted attention in transplantation. Anti-MICA antibodies have been associated with increased risk of rejection and reduced graft survival in kidney transplantation, although not all studies are concordant. In a pivotal New England Journal of Medicine study, the presence of anti-MICA antibodies was associated with lower 1-year graft survival and higher incidence of acute rejection [15]. Subsequent reports and commentaries have confirmed a potential role for anti-MICA (and, to a lesser extent, anti-MICB) as non-HLA donor-specific antibodies contributing to allograft injury, while also highlighting heterogeneity between cohorts and methodologies [16,17,18]. These observations highlight the clinical importance of MICA/MICB. However, interpreting mismatches requires solid population-level data.

In allogeneic hematopoietic stem cell transplantation (HSCT), MICA and MICB are increasingly recognized as non-classical histocompatibility antigens and as key modulators of NK-cell and T-cell alloreactivity. Because NK cells and cytotoxic T cells are among the first lymphocyte subsets to reconstitute after HSCT, variation in the NKG2D–MICA/MICB axis can influence both graft-versus-leukemia and graft-versus-host responses. Several studies have shown that donor–recipient mismatches at MICA, particularly at the functional MICA-129 Met/Val dimorphism, are associated with the incidence and severity of acute and chronic graft-versus-host disease (GVHD), transplant-related mortality, and overall survival, and that matching for MICA-129 can improve outcomes in otherwise HLA-matched unrelated donor HSCT [19,20,21,22,23]. Single-nucleotide polymorphisms and mismatches in MICB, as well as variability in soluble MICA/MICB levels, have likewise been linked to post-transplant complications and GVHD in allogeneic HSCT recipients [24,25,26]. In addition, preformed anti-MICA antibodies have been reported as relevant non-HLA antibodies in the HSCT setting, especially in HLA-mismatched transplants, and may contribute to graft failure and other adverse events [27,28]. These observations indicate that MICA and MICB are not only important stress ligands in tumor and infection biology but also clinically relevant determinants of outcome in hematopoietic stem cell and progenitor transplantation.

The analytical landscape has changed substantially with the introduction of capture-based NGS typing platforms such as AlloSeq Tx17 (CareDx), which use hybrid capture rather than long-range PCR, offer full-gene coverage of classical HLA loci, and include non-classical genes such as HLA-E, HLA-F, HLA-G, HLA-H, MICA, and MICB in a single workflow [29]. This technology enables high-resolution MICA/MICB typing in routine histocompatibility laboratories and has already been used to generate comprehensive allele-frequency datasets for HLA-G, HLA-F, MICA, MICB, and related loci [30]. However, most analyses of these data still rely on classical summaries: allele frequencies, heterozygosity, pairwise LD, and simple haplotype tables, without fully exploiting the structure of multilocus genotype data.

In parallel, artificial intelligence (AI) and machine learning (ML) are increasingly applied across transplantation and immunogenetics. ML models have been used in compatibility and donor/recipient matching, predict crossmatch compatibility, and identify pre-transplant risk factors for de novo HLA-specific antibody development [31,32,33,34]. Recent work has demonstrated that ML-enhanced immunologic risk prediction can improve performance over conventional rules-based algorithms by integrating detailed HLA allele data, antibody profiles, and clinical covariates [31,35]. Reviews focused on kidney transplantation have highlighted the potential of AI and ML to support allocation, monitoring, and individualized immunosuppression, highlighting the importance of integrating immunogenetic information, including non-classical loci, into such models [32]. Although NGS genotyping produces high-dimensional data, MICA/MICB diversity has rarely been explored using unsupervised ML approaches.

Most published MICA/MICB population studies have limitations. First, sample sizes are often modest and focused on specific patient groups or donor registries, which may not reflect the broader regional population structure [8,10]. Second, the analyses typically stop at one- or two-locus summaries, providing limited insight into the global architecture of MICA–MICB genotype constellations across individuals. Third, advanced exploratory tools such as clustering based on high-dimensional allele presence/absence matrices, manifold learning (principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP)), and network-based descriptions of co-occurrence, have rarely been applied to non-classical MHC loci, even though they are now standard in other areas of genomics and single-cell biology. As a result, there is still an incomplete picture of how MICA and MICB polymorphism is organized at the population level and how distinct genotype backgrounds might be recognized or targeted in clinical and research settings.

From a clinical perspective, donor selection increasingly extends beyond classical HLA matching. Non-HLA immunogenetic factors such as MICA and MICB mismatches and anti-MICA antibodies have been associated with graft rejection and graft-versus-host disease in both solid organ transplantation and HSCT. However, interpretation of these mismatches requires population-specific baseline information. Without knowledge of which MICA–MICB combinations are common, rare, or part of conserved genetic backgrounds, it is difficult to distinguish clinically meaningful incompatibilities from expected variation. Therefore, the practical goal of this study is not merely descriptive but to provide a population reference framework that allows histocompatibility laboratories to contextualize non-classical mismatches, prioritize potentially relevant combinations, and support future risk-prediction algorithms incorporating non-HLA loci.

The present study addresses this gap by analyzing a large cohort from southern Portugal, typed at high resolution for MICA and MICB using a hybrid-capture NGS workflow. The analysis combines classical population-genetic metrics (allele and genotype frequencies, heterozygosity, Hardy–Weinberg equilibrium (HWE), and LD) with unsupervised ML techniques (k-means and spectral clustering, PCA, t-SNE, UMAP, Jaccard-based similarity mapping, and network analysis) to provide an integrated view of MICA/MICB population structure. In addition to characterizing allele distributions in an Iberian population that is underrepresented in the MICA/MICB literature, the study maps recurrent MICA–MICB genotypes and identifies clusters that could inform future association studies in transplantation, infection, and cancer immunology.

2. Materials and Methods

2.1. Study Design and Anonymized Dataset

This retrospective analysis used a fully anonymized dataset of MICA and MICB genotypes obtained from routine histocompatibility testing in southern Portugal. No personal identifiers, clinical variables, dates, or other information enabling re-identification were available to the investigators, and no additional procedures or contacts with patients were involved. According to institutional policy and national regulations, studies based solely on non-identifiable data generated in routine care are exempt from formal Ethics Committee review and from the requirement for individual informed consent. The study was nevertheless conducted in line with the principles of the Declaration of Helsinki, with particular emphasis on data minimization, confidentiality, and the protection of participants’ rights and welfare.

2.2. Sample Collection and DNA Extraction

Peripheral blood was collected in EDTA tubes as part of routine histocompatibility testing. Genomic DNA was isolated in the local tissue typing laboratory using standard procedures based on either salting-out or silica-membrane column purification, following the manufacturers’ instructions. DNA quantity and purity were assessed spectrophotometrically, and samples meeting the minimal requirements for NGS library preparation were stored at −20 °C until use.

2.3. High-Resolution MICA and MICB Typing

High-resolution typing of MICA and MICB was performed as part of a capture-based NGS HLA typing workflow using the AlloSeq Tx solution (AlloSeq Tx17, CareDx, Brisbane, CA, USA). AlloSeq Tx is an NGS-based typing system that uses hybrid capture technology instead of long-range PCR, enabling full-gene coverage of classical HLA loci and several additional transplant-associated genes, including HLA-E, HLA-F, HLA-G, HLA-H, MICA, and MICB.

Library preparation followed the manufacturer’s protocol. Briefly, genomic DNA (≥50 ng per sample) was fragmented and tagged in a single-tube tagmentation step, followed by index PCR with early barcoding to minimize sample-switching risk. After bead-based normalization and pooling, libraries were enriched by hybridization with a biotinylated probe panel targeting full HLA genes and the additional non-classical loci and subsequently captured using streptavidin-coated magnetic beads. Enriched, indexed libraries were quantified, pooled, and sequenced on an Illumina NGS platform.

Allele assignment was performed using the AlloSeq Assign software (version 1.0.6.1409) and the IMGT/HLA (version 3.57.0.0 8 July 2024). Two high-resolution alleles at MICA and two at MICB were available for each individual. Genotypes were treated as diploid and unordered, with homozygous configurations coded as A/A and heterozygous configurations coded as A/B. Consequently, all analyses reflect diploid genotype structure, respecting the biological inheritance that one MICA allele and one MICB allele originate from each progenitor.

This principle was respected in all downstream calculations, including:

allele-frequency computation (2N alleles per locus);
diversity indices;
Hardy–Weinberg equilibrium tests;
multilocus heterozygosity assessment;
co-occurrence and LD estimation;
construction of combined MICA–MICB genotype configurations;
machine-learning feature encoding (presence/absence of alleles);
haplotype-level visualization of recurrent genotype configurations.

2.4. Data Curation and CODING Strategy

Raw genotype fields were harmonized prior to analysis by enforcing consistent use of asterisks and colons, trimming whitespace, and converting ambiguous or missing values to “NA”. Alleles annotated with the “Q” suffix (e.g., MICA*011:01:04Q, MICA*011:01Q) were retained as distinct specificities, following IPD-IMGT/HLA nomenclature, and were not collapsed with the corresponding non-Q alleles. The Q flag was treated as an annotation of potential sequence ambiguity but did not modify the underlying allele coding used for frequency estimates, LD calculations, or clustering and t-SNE/UMAP visualizations. For allele-level analyses, each individual contributed two allele copies per locus (2N alleles). For genotype-level analyses, the two alleles at MICA and the two alleles at MICB were retained as diploid unordered genotypes. For multilocus analyses, MICA and MICB genotypes were combined into four-allele genotype constellations.

2.5. Population-Genetic Analyses

Allele frequencies were calculated from 2N allele copies per locus, where N is the number of individuals. For each locus, each genotype contributed two allele copies to the frequency estimates. Diversity metrics (Shannon entropy (H), and Pielou evenness (H/log k, where k is the number of observed alleles)) were computed for each locus. Observed heterozygosity was calculated per locus as the proportion of individuals carrying two different alleles, and multilocus heterozygosity as the number of heterozygous loci (0, 1, or 2) per individual. HWE was evaluated for the most frequent alleles at each locus using chi-square testing applied to AA, Aa and aa genotype counts considered as a bi-allelic system.

2.6. Inter-Locus Association and Linkage Disequilibrium

Statistical dependence between MICA and MICB alleles was evaluated at the individual level. For selected frequent alleles, co-occurrence frequencies were calculated and visualized as a heatmap. Classical linkage disequilibrium parameters (D, D′ and r²) were derived from carrier frequencies:

p(A): carrier frequency of A
p(B): carrier frequency of B
p(AB): frequency of individuals carrying both A and B.

Allele pairs exhibiting the strongest r² values were summarized. Because multiple MICA–MICB allele pairs were evaluated, the LD statistics are reported as descriptive measures of effect size rather than as formally multiplicity-adjusted hypothesis tests. Interpretation therefore focuses on associations with large magnitude (e.g., high |D′| and r²) that are robust to multiple-testing considerations, whereas weaker signals are treated as exploratory.

2.7. Combined MICA–MICB Genotype Constellations

All distinct four-allele MICA–MICB genotype constellations were enumerated. Rare constellations (carried by single individuals) were retained for frequency tabulation but only recurrent configurations (defined a priori as occurring in ≥3 individuals) were used in exploratory machine-learning visualizations and cluster maps.

2.8. Machine-Learning Encoding and Clustering

For unsupervised exploration of allele-distribution structure, a binary presence/absence matrix was constructed across individuals. Each distinct allele represented a feature; the feature value was set to 1 when present and 0 otherwise, yielding a sparse binary feature space.

k-means clustering was applied to this matrix with Euclidean distance. The optimal number of clusters was selected by analysis of silhouette scores. As a complementary method, spectral clustering was applied to a similarity graph based on pairwise Jaccard indices.

2.9. Dimensionality-Reduction and Manifold Projection

To visualize multilocus structure, three complementary manifold-learning methods were used:

Principal Component Analysis;
t-distributed Stochastic Neighbor Embedding;
Uniform Manifold Approximation and Projection.

Cluster membership, allele counts, and heterozygosity were overlaid on the low-dimensional embedding space to aid biological interpretation.

2.10. Haplotype-Level Low-Dimensional Mapping

Combined MICA–MICB genotype constellations occurring in ≥3 individuals were encoded and embedded in two dimensions using t-SNE. k-means clustering was then performed in the t-SNE space. Voronoi-style decision regions were plotted to delineate haplotypic “territories”, facilitating interpretation of recurrent genotype backgrounds.

2.11. Similarity Metrics and Hierarchical Clustering

Pairwise Jaccard similarity coefficients were computed between individuals based on allele presence/absence. The resulting similarity matrix was displayed as a heatmap. A dissimilarity matrix (1—Jaccard) was used for agglomerative hierarchical clustering (average linkage), and the corresponding dendrogram was examined for grouping patterns.

2.12. Software Environment and Reproducibility

All analyses were performed in Python (version 3.13), executed in a reproducible scripted environment. The following major libraries were used:

pandas (data structuring and manipulation)
NumPy (vectorized numeric operations)
SciPy (statistical functions)
scikit-learn (k-means, spectral clustering, PCA, t-SNE)
UMAP-learn (UMAP projections)
matplotlib/seaborn (visualization)

Scripts were version-controlled and executed without manual intervention to guarantee reproducibility. Intermediate and final outputs consisted of machine-readable CSV tables and high-resolution figures.

3. Results and Discussion

The final dataset comprised 1364 unrelated individuals from southern Portugal typed at high resolution for MICA and MICB using a capture-based NGS workflow. Forty-two distinct MICA alleles and twenty-two MICB alleles were identified (presented in Table 1). At the MICA locus, the most frequent alleles were MICA*002:01 and MICA*008:01, each accounting for approximately 16.9% of all observed MICA allele copies in the cohort, followed by MICA*004:01 (14.3%), MICA*008:04 (10.0%) and MICA*009:01 (8.3%). At MICB, MICB*005:02 predominated and represented 41.2% of all MICB allele copies, with MICB*002:01 (20.9%), MICB*004:01 (15.2%), MICB*008:01 (9.2%) and MICB*005:03 (3.8%) forming a second tier of common variants. Figure 1A,B shows the dominance of a few alleles.

These spectra are broadly concordant with reports from other European cohorts typed by NGS, where alleles of the MICA*002, MICA*004, MICA*008 and MICB*005, MICB*004, MICB*002 groups also dominate, although the exact rank order and contribution of low-frequency alleles vary across regions. In a German donor registry of more than two million individuals, MICA*008, MICA*002 and MICA*009, and MICB*005, MICB*004 and MICB*002 were the most frequent alleles, mirroring the pattern observed here but with substantially higher frequencies for MICA*008 and MICB*005 in that population [9]. A recent Finnish analysis likewise confirmed extensive MICA and MICB polymorphism with notable regional frequency differences within Europe [36]. The present data extend this picture by providing a detailed allele-frequency reference for a southern Portuguese population, a region that has been underrepresented in non-classical MHC class I population studies and may be directly useful for histocompatibility laboratories in the Iberian area, that have presented only small studies [30].

Diversity metrics reinforced the impression that MICA is more polymorphic than MICB in this cohort (presented in Table 2). Shannon entropy was higher for MICA (H ≈ 2.7) than for MICB (H ≈ 1.8), and the expected heterozygosity (He) likewise indicated a broader and more even distribution of MICA alleles. This pattern is consistent with prior observations that MICA exhibits particularly rich allelic diversity in Europe and South America [8,9], whereas MICB, although clearly polymorphic, tends to show a more concentrated spectrum dominated by a few common alleles. From an evolutionary perspective, such a contrast is compatible with a combination of balancing selection on ligand–receptor interactions, hitch-hiking with classical HLA haplotypes, and demographic processes shaping the non-classical MHC region. Recent large-scale imputation work has also highlighted that MICA, MICB, HLA-E, HLA-F and HLA-G carry substantial information content beyond classical HLA loci and can now be inferred accurately in biobank-scale datasets, further underscoring their relevance for population and disease studies [37].

Observed heterozygosity was high at both loci, but again higher for MICA (≈0.90) than for MICB (≈0.74) (Table 2). At the individual level, multilocus heterozygosity exemplified the richness of non-classical MHC variation in this population: 932 individuals (68.3%) were heterozygous at both loci, 371 (27.2%) were heterozygous at one locus only, and only 61 (4.5%) were homozygous at both MICA and MICB (Figure 2). The same information is summarized numerically in Table 3. Thus, most individuals express two distinct MICA allotypes and two distinct MICB allotypes, creating considerable diversity in the repertoire of NKG2D ligands at the cellular surface. This multilocus diversity is biologically plausible for stress-inducible ligands involved in immune surveillance and may be particularly relevant in settings where NK cells and cytotoxic T cells play central roles, such as viral infection, tumor immunology and transplantation. In allogeneic HSCT, where NK cells and CD8⁺ T cells reconstitute early and NKG2D engagement can influence both graft-versus-leukemia and graft-versus-host disease (GVHD), the presence of multiple MICA and MICB allotypes per individual is of special interest because it broadens the possible ligand landscape experienced by donor-derived effector cells [21,25,26,38].

HWE analyses for the most frequent alleles did not reveal strong deviations from equilibrium expectations. For example, the genotype distributions of MICA*002:01 and MICB*005:02 were compatible with HWE under a panmictic population model, arguing against gross genotyping artifacts or very strong selection acting on single alleles in this cohort (Table 4). The absence of marked HWE departures at the allele level does not exclude more subtle selection on extended haplotypes, cis-regulatory variants or amino-acid motifs, but it provides reassurance regarding the technical robustness and population coherence of the dataset. This is important because artifacts in NGS-based typing of non-classical loci remain a concern in some settings; recent work has emphasized the need for careful assay validation and curated reference panels for MICA and MICB.

The relationship between MICA and MICB polymorphism was further examined through allele co-occurrence patterns and LD-like statistics. A heatmap summarizing the co-occurrence of the fifteen most frequent alleles at each locus (Figure 3 and Supplementary Table S1) revealed several non-random combinations at the individual level.

LD metrics showed strong associations for common MICA–MICB pairs, including MICA*009:02–MICB*005:06 (D′ ≈ 0.94, r² ≈ 0.69), MICA*016:01–MICB*005:01 (r² ≈ 0.42), MICA*008:04–MICB*004:01 (r² ≈ 0.16) and MICA*008:01–MICB*008:01 (r² ≈ 0.12) (Figure 4). These high r² values indicate conserved haplotypic backgrounds where specific MICA and MICB alleles are inherited simultaneously much more frequently than expected under random assortment. Similar non-random associations between MICA, MICB and HLA-B have been documented in Brazilian, Bulgarian and Finnish cohorts and are thought to reflect the shared evolutionary history of this segment of chromosome 6 [8,10,36]. From a practical perspective, such conserved haplotypes are useful for imputation and can inform interpretation of non-classical mismatches in donors typed only at classical HLA loci. Such strong MICA–MICB associations are also consistent with the broader architecture of conserved extended haplotypes in the MHC, in which classical and non-classical loci can be co-inherited across megabase-length segments [11].

At the genotype level, 606 distinct four-allele MICA–MICB genotype constellations (unordered MICA genotype plus unordered MICB genotype) were observed among the 1364 individuals. As expected for highly polymorphic loci, the distribution was long-tailed: many constellations occurred only once or twice, whereas a limited subset of recurrent backgrounds accounted for a substantial fraction of all genotypes. The two most frequent constellations were MICA*004:01/MICA*008:04 with MICB*004:01/MICB*005:02 and MICA*004:01/MICA*008:01 with MICB*005:02/MICB*005:02, each present in seventeen individuals. Several other constellations combining MICA*002:01, MICA*004:01 or MICA*008:01 with MICB*002:01 and/or MICB*005:02 were also prominent. These recurrent genotype backbones represent the dominant NKG2D-ligand contexts expected in donors and recipients from this region and form a natural reference for future association studies in solid organ transplantation, HSCT and infection.

To look beyond single-locus frequency summaries, individual genotypes were encoded as a binary presence/absence matrix across all observed MICA and MICB alleles, and unsupervised machine-learning methods were applied to this high-dimensional space. K-means clustering was performed with k between 2 and 7, and the average silhouette coefficient, a standard measure of cluster separation, suggested k = 3 as a reasonable compromise between separation and interpretability (average silhouette ≈ 0.14; Figure 5). Numeric silhouette scores for k values from 2 to 7 are provided in Supplementary Table S5. The silhouette value is modest, but this is typical for complex genetic datasets with overlapping groups, and it indicates that the data are better described as a continuous structure with areas of higher density rather than sharply separated subpopulations.

The final k-means partition produced three clusters of comparable size (430, 498 and 436 individuals). Each cluster displayed a characteristic, although overlapping, allelic signature, with several common alleles (for example MICA*008:01 and MICB*005:02) contributing substantially to more than one cluster (Table 5). One cluster was enriched for MICA*008:01, MICA*004:01 and MICA*002:01 together with MICB*005:02; a second was characterized by MICA*002:01 and MICA*009:01 in combination with a high frequency of MICB*002:01; and a third cluster showed increased frequencies of MICA*008:01, MICA*008:04, MICA*004:01 and MICA*009:02, accompanied by an MICB profile enriched in MICB*004:01, MICB*008:01, MICB*003:01, MICB*005:06 and MICB*005:03. These clusters likely represent overlapping genetic backgrounds rather than discrete subgroups, consistent with the complex LD structure of the MHC.

Dimensionality-reduction techniques provided complementary views of this structure. PCA positioned individuals in a two-dimensional space where the three k-means clusters occupied partially distinct, but overlapping, regions along the first two principal components (Figure 6). Non-linear methods, namely t-SNE and UMAP, emphasized local neighborhood relationships and revealed more clearly delineated “clouds” corresponding to each cluster, with the cluster enriched for genotypes carrying MICA*009:02 and MICB*005:06 tending to occupy a more peripheral position in the embedding space (Figure 7 and Figure 8). These manifold-learning maps visualize the MICA/MICB genotype landscape, revealing regions of higher density and gradations between common haplotypic backgrounds. Similar visual strategies have been applied successfully to other high-dimensional immunogenetic datasets, including single-cell transcriptomic and T-cell receptor repertoires, to reveal structure that is not obvious in raw categorical data.

As a sensitivity analysis, spectral clustering based on a nearest-neighbor Jaccard similarity graph was also applied. This method identified one large group of 1344 individuals and two very small clusters of ten individuals each. Exact cluster sizes for both k-means and spectral clustering are summarized in Supplementary Table S6. The very small cluster sizes suggest that these sets most likely correspond to rare or unusual haplotypic configurations at the edges of the distribution, rather than to biologically distinct subpopulations. Given their exploratory nature and limited interpretability, the detailed spectral clustering results are best presented as Supplementary Material, while the k-means-based partition and the associated PCA/t-SNE/UMAP maps remain the focus of the main text.

To examine similarity among individuals in more detail, pairwise Jaccard similarity was computed from the allele presence/absence matrix. The resulting similarity heatmap and associated hierarchical clustering (Figure 9, Figure 10 and Figure 11) revealed blocks of individuals sharing highly similar MICA/MICB repertoires, broadly corresponding to the k-means clusters but also containing smaller subgroups driven by particular alleles or allele combinations.

No isolated clusters of individuals were apparent, again arguing against strong hidden substructure within the cohort. In parallel, an allele–allele Jaccard similarity heatmap (Figure 12) grouped MICA and MICB alleles that tend to co-occur more frequently than expected by chance. The corresponding edge list for the allele co-occurrence network, which is particularly useful for network-based visualization in software such as Cytoscape (version 3.10.3), is most appropriately provided as Supplementary Table S3.

From a clinical perspective, clustering and similarity metrics are not intended to define biological subpopulations but to quantify immunogenetic proximity between individuals. In transplantation, compatibility is rarely binary; rather, recipients may differ from donors by common or unusual genetic backgrounds. Clustering and Jaccard similarity provide a framework to measure whether a donor–recipient pair shares a typical regional MICA–MICB background or represents an uncommon combination potentially associated with increased alloreactivity [33]. These approaches may therefore support future allocation algorithms or risk-stratified analyses rather than immediate clinical decision rules.

Recurrent four-allele MICA–MICB genotype constellations observed in at least three individuals were further examined at the haplotype-constellation level using t-SNE (Figure 13). In this analysis, each point corresponded to a distinct MICA/MICB genotype constellation, and distances in the two-dimensional map reflected similarity in the underlying allele sets. The central portion of the map was populated by frequent constellations built around MICA*002:01, MICA*004:01, MICA*008:01, MICB*005:02 and MICB*002:01, whereas more peripheral regions were occupied by constellations containing rarer alleles such as MICA*009:02 and MICB*005:06. K-means clustering in the t-SNE plane identified several haplotype-constellation clusters (Supplementary Table S4), and decision-region plots helped visualize how these clusters tile the genotype space. Conceptually, this representation offers a compact map of the non-classical MHC genotype landscape: donor–recipient pairs lying within the same region of the map are expected to share closely related MICA/MICB backgrounds, whereas pairs located in different regions are more likely to differ qualitatively in NKG2D-ligand context.

A complementary view of genotype organization was obtained by tabulating and visualizing the most common MICA and MICB genotypes in a two-dimensional matrix (Figure 14). This matrix highlighted a limited set of highly recurrent genotype combinations, mostly involving MICA*002:01, MICA*004:01, MICA*008:01 and MICB*002:01, MICB*004:01, MICB*005:02, which accounted for a large proportion of the cohort. Rare genotype combinations tended to appear in sparsely populated rows and columns, echoing their peripheral position in the t-SNE map. This dual representation (clustered matrix and manifold embedding) makes it easier to identify genotype backgrounds that may deserve specific attention in future association analyses.

These findings have implications for both solid organ transplantation and allogeneic HSCT. In kidney transplantation, anti-MICA antibodies, often targeting epitopes present on common alleles such as MICA*008, have been associated in several cohorts with acute rejection, chronic allograft dysfunction and inferior graft survival, even after adjustment for classical HLA donor-specific antibodies, although not all studies have been consistent. The landmark New England Journal of Medicine study by Zou and colleagues reported lower one-year graft survival and higher rejection rates in recipients with anti-MICA antibodies. More recent work continues to support a role for anti-MICA, and potentially anti-MICB, as non-HLA donor-specific antibodies contributing to allograft injury, particularly when combined with HLA donor-specific antibodies. Knowledge of which MICA and MICB alleles and combined genotypes dominate in a given population is crucial for interpreting non-HLA antibody profiles, designing bead panels that adequately cover local polymorphism, and quantifying the incremental predictive value of anti-MICA/MICB antibodies in graft-outcome models [15].

In HSCT and hematopoietic progenitor transplantation, MICA and MICB have emerged as non-classical histocompatibility determinants acting through the NKG2D axis. Mismatches at MICA, particularly involving the functional MICA-129 Met/Val dimorphism that modulates NKG2D signaling strength, have been associated with acute and chronic GVHD, relapse and survival in HLA-matched unrelated donor HSCT. Matching for MICA-129 has been reported to improve outcomes in otherwise 10/10 HLA-matched transplants, and recent data from large European cohorts and meta-analyses confirm that MICA-129 mismatches can act as independent risk factors for inferior disease-free survival. MICB mismatches have also begun to receive attention, with emerging evidence that, in 9/10 HLA-matched settings, MICB mismatches may worsen disease-free and GVHD- and relapse-free survival. The present population map does not incorporate outcome data, but it provides a detailed haplotypic and multilocus framework against which future HSCT studies in this region can be interpreted. In particular, it identifies the dominant MICA–MICB genotype constellations, quantifies the frequency of alleles known or suspected to be functionally important (including those encoding MICA-129 variants), and shows how these constellations cluster in genotype space. This information can support study designs that examine whether specific non-classical haplotypic backgrounds or cluster membership are associated with GVHD, relapse, infection or transplant-related mortality [19,20,22].

Combining classical population genetics with unsupervised machine learning reveals structure often missed by standard descriptive statistics. The clustering, similarity mapping and t-SNE/UMAP visualizations do not replace hypothesis-driven association tests, but they provide a global view of the MICA/MICB genotype landscape, highlight features such as conserved haplotypes and dominant backgrounds, and suggest biologically plausible partitions that may be useful for stratification in future clinical or functional analyses. Comparable computational pipelines are increasingly used to integrate HLA data, antibody profiles and clinical covariates into ML-based risk prediction models for solid organ and HSCT outcomes, and the present work shows how these concepts can be extended to non-classical MHC class I ligands.

Several limitations should be acknowledged. First, the cohort originates from a single histocompatibility laboratory in southern Portugal and, although numerically robust, may not fully capture the diversity of the broader Portuguese or Iberian populations. Explicit ancestry information and replication in additional centers would strengthen generalizability. Second, the dataset was intentionally restricted to fully anonymized genotypes without demographic, clinical or outcome data, so the analyses remain descriptive and cannot address associations with rejection, GVHD, relapse, infection or survival. Third, unsupervised algorithms are sensitive to parameter choices and to the curse of dimensionality; the moderate silhouette scores and reliance on two-dimensional embeddings for visualization mean that clusters and maps should be interpreted as heuristic summaries rather than definitive subpopulation boundaries. Fourth, although the typing platform also generates classical HLA data, phase-resolved extended MHC haplotypes could not be reconstructed because the dataset was anonymized and lacked family or segregation information. Therefore, the present analysis focuses on allele co-occurrence rather than formally defined ancestral haplotypes. In future multi-center datasets with appropriate phasing strategies (e.g., family-based data or statistical phasing), extended MHC haplotypes integrating classical HLA loci and non-classical genes from the same platform could be reconstructed, enabling direct testing of clinical associations. In addition, only MICA and MICB were interrogated in depth, even though the underlying NGS platform also provides high-resolution data for HLA-E, HLA-F, HLA-G and other loci that participate in NK-cell and T-cell regulation, and emerging data suggest that structural variants such as MICA deletions may introduce additional layers of complexity.

From a real-world perspective, the present dataset is best used to complement, rather than replace, classical HLA matching. Its most immediate value is to support targeted, case-based use of MICA/MICB typing and non-HLA antibody interpretation and to guide the development and validation of more detailed histocompatibility assays. In practice, three workflows are particularly realistic:

(i) Non-HLA antibody workup in solid organ transplantation: when anti-MICA and/or anti-MICB reactivity is detected (or suspected in the setting of antibody-mediated injury not fully explained by HLA-DSA), laboratories can use local allele and genotype frequency information to interpret whether the implicated specificities represent common regional allotypes versus rare variants and to prioritize confirmatory testing and panel coverage accordingly. When donor typing is available, this enables a structured “non-HLA virtual crossmatch” by determining whether the donor expresses the implicated allotype/epitope and by documenting the mismatch context for multidisciplinary risk discussions and post-transplant monitoring strategies;

(ii) Selected HSCT donor evaluation when donors are otherwise comparable by classical HLA: in centers that already consider non-classical determinants, the recurrent multilocus constellations and similarity structure provide a practical way to judge whether a candidate donor shares a typical regional MICA/MICB background with the recipient or represents a less common constellation that may warrant closer evaluation in outcome-linked studies;

(iii) Diagnostic development and quality assurance: for histocompatibility laboratories and diagnostic manufacturers, the restricted set of common alleles and recurrent genotype backbones offers an evidence-based basis to assess population coverage of bead content, design validation panels, and prioritize which specificities are essential to include for Iberian recipients. Because commercial diagnostics serve international populations, we emphasize that these frequencies should be interpreted as a southern Portuguese/Iberian reference and should be complemented by analogous datasets from other populations for worldwide assay design; however, the analytical framework presented here is directly transferable to such multi-population efforts.

Despite these limitations, our results provide a detailed view of MICA and MICB diversity in a Southern European population. The combination of allele-frequency analysis, diversity metrics, LD mapping, multilocus genotype enumeration, individual-level similarity structure and haplotype-level manifold representations provides a detailed description of the non-classical MHC class I ligand landscape. This framework can assist histocompatibility laboratories in interpreting non-HLA antibodies and in identifying clinical scenarios where MICA/MICB typing is most actionable (e.g., anti-MICA/anti-MICB sensitization or selected HSCT settings), and it can support future integrated risk models that incorporate non-classical MHC determinants alongside classical HLA and clinical covariates.

4. Conclusions

High-resolution typing of MICA and MICB in 1364 unrelated individuals from southern Portugal provides a detailed reference for non-classical MHC class I variation in this region. A restricted set of common alleles, particularly MICA*002:01, MICA*004:01, MICA*008:01, MICA*008:04 and MICB*002:01, MICB*004:01, MICB*005:02, MICB*008:01, accounts for a large proportion of the observed diversity, overlaid on a long tail of less frequent variants. Diversity indices and heterozygosity estimates indicate that MICA is more polymorphic and more even than MICB, but both loci contribute substantially to individual-level variability, with most individuals carrying at least two distinct MICA and two distinct MICB allotypes.

Joint analysis of the two loci shows that this variation is not randomly organized. Strong LD-like associations identify conserved MICA–MICB allele pairs, and multilocus enumeration highlights a limited number of recurrent four-allele constellations that dominate the genotype landscape. Unsupervised clustering and manifold-learning projections reveal continuous structures organized around a few dense genetic backgrounds rather than sharply separated subpopulations. These results outline a compact but informative map of the MICA/MICB genotype space in a Southern European cohort.

This map has direct implications for transplantation and immunogenetics. In solid organ transplantation, knowledge of locally prevalent MICA and MICB alleles and genotype backbones can inform the design and interpretation of non-HLA antibody testing and support studies assessing the added prognostic value of anti-MICA and anti-MICB antibodies. In hematopoietic stem cell and progenitor transplantation, increasing evidence links MICA and MICB mismatches, including functionally relevant variants such as the MICA-129 dimorphism, to GVHD, relapse and survival. The present data identify the principal non-classical MHC backgrounds and their relative frequencies, providing a practical framework for future outcome studies that explicitly incorporate MICA and MICB into adjunct donor evaluation and risk-stratification models.

Several limitations qualify the interpretation. The cohort originates from a single center and lacks explicit ancestry, phenotypic and outcome information; the analyses are therefore descriptive and cannot address causality or clinical effect sizes. Unsupervised methods and low-dimensional embeddings are sensitive to parameter choices and should be regarded as heuristic summaries rather than strict subpopulation definitions. In addition, the present work focuses on allele-level genotypes and does not incorporate expression-level variation, structural variants or epitope-based mismatches.

Despite these limitations, the integrated population-genetic and AI-assisted analyses presented here deliver a coherent overview of MICA and MICB diversity in southern Portugal and provide a basis for future clinical and computational studies in which non-classical MHC class I ligands are treated as integral components of the histocompatibility landscape.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/j9010008/s1, Table S1: Co-occurrence matrix for common MICA and MICB alleles; Table S2: LD-like measures for selected common MICA–MICB allele pairs; Table S3: Edge list for the allele co-occurrence network; Table S4: Cluster membership of recurrent MICA–MICB genotype constellations in the t-SNE map. Table S5: Silhouette coefficients for k-means clustering of MICA/MICB genotypes; Table S6: Cluster sizes for k-means and spectral clustering.

Author Contributions

L.R., P.A. and R.A. conceptualized the draft. L.R., P.A. and R.A., contributed equally to writing and reviewing of the original draft through interpretation of the literature. L.R., P.A. and R.A., E.E., reviewed and edited the original draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical approval is not required as per local legislation [studies based solely on non-identifiable data generated in routine care are exempt from formal Ethics Committee review and from the requirement for individual informed consent].

Informed Consent Statement

Informed consent for participation is not required as per local legislation [studies based solely on non-identifiable data generated in routine care are exempt from formal Ethics Committee review and from the requirement for individual informed consent].

Data Availability Statement

Data are contained within the article or Supplementary Material.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MHC	Major histocompatibility complex
MICA	Major histocompatibility complex (MHC) class I chain-related protein A
MICB	Major histocompatibility complex (MHC) class I chain-related protein B
LD	Linkage disequilibrium
NGS	Next-generation sequencing
HSCT	Hematopoietic stem cell transplantation
GVH	Graft-versus-host disease
AI	Artificial intelligence
ML	Machine learning
PCA	Principal component analysis
t-SNE	t-distributed stochastic neighbor embedding
UMAP	Uniform manifold approximation and projection
HWE	Hardy–Weinberg equilibrium

References

González, S.; López-Soto, A.; Suarez-Alvarez, B.; López-Vázquez, A.; López-Larrea, C. NKG2D ligands: Key targets of the immune response. Trends Immunol. 2008, 29, 397–403. [Google Scholar] [CrossRef] [PubMed]
López-Larrea, C.; Suárez-Alvarez, B.; López-Soto, A.; López-Vázquez, A.; Gonzalez, S. The NKG2D receptor: Sensing stressed cells. Trends Mol. Med. 2008, 14, 179–189. [Google Scholar] [CrossRef]
Raulet, D.H. Roles of the NKG2D immunoreceptor and its ligands. Nat. Rev. Immunol. 2003, 3, 781–790. [Google Scholar] [CrossRef]
Xing, S.; Ferrari de Andrade, L. NKG2D and MICA/B shedding: A “tag game” between NK cells and malignant cells. Clin. Transl. Immunol. 2020, 9, e1230. [Google Scholar] [CrossRef]
Zhang, J.; Basher, F.; Wu, J.D. NKG2D ligands in tumor immunity: Two sides of a coin. Front. Immunol. 2015, 6, 97. [Google Scholar] [CrossRef]
Ghadially, H.; Brown, L.; Lloyd, C.; Lewis, L.; Lewis, A.; Dillon, J.; Sainson, R.; Jovanovic, J.; Tigue, N.J.; Bannister, D.; et al. MHC class i chain-related protein A and B (MICA and MICB) are predominantly expressed intracellularly in tumour and normal tissue. Br. J. Cancer 2017, 116, 1208–1217. [Google Scholar] [CrossRef] [PubMed]
Lakes, N.; Canaday, L.M.; Waggoner, S.N. Can’t drop the MIC(A/B): Preventing stress-ligand shedding to enhance pan-cancer targeting. Med 2023, 4, 398–400. [Google Scholar] [CrossRef] [PubMed]
Yamakawa, R.H.; Saito, P.K.; Gelmini, G.F.; da Silva, J.S.; Bicalho, M.d.G.; Borelli, S.D. MICA diversity and linkage disequilibrium with HLA-B alleles in renal-transplant candidates in southern Brazil. PLoS ONE 2017, 12, e0176072. [Google Scholar] [CrossRef]
Klussmeier, A.; Massalski, C.; Putke, K.; Schäfer, G.; Sauter, J.; Schefzyk, D.; Pruschke, J.; Hofmann, J.; Fürst, D.; Carapito, R.; et al. High-Throughput MICA/B Genotyping of Over Two Million Samples: Workflow and Allele Frequencies. Front. Immunol. 2020, 11, 314. [Google Scholar] [CrossRef]
Al Hadra, B.; Lukanov, T.I.; Ivanova, M.I. HLA class I chain-related MICA and MICB genes polymorphism in healthy individuals from the Bulgarian population. Hum. Immunol. 2022, 83, 551–555. [Google Scholar] [CrossRef]
Alper, C.A. The Path to Conserved Extended Haplotypes: Megabase-Length Haplotypes at High Population Frequency. Front. Genet. 2021, 12, 716603. [Google Scholar] [CrossRef]
Gao, X.; Single, R.M.; Karacki, P.; Marti, D.; O’Brien, S.J.; Carrington, M. Diversity of MICA and Linkage Disequilibrium with HLA-B in Two North American Populations. Hum. Immunol. 2006, 67, 152–158. [Google Scholar] [CrossRef]
Bolognesi, E.; Dalfonso, S.; Rolando, V.; Fasano, M.E.; Praticò, L.; Momigliano-Richiardi, P. MICA and MICB microsatellite alleles in HLA extended haplotypes. Eur. J. Immunogenet. 2001, 28, 523–530. [Google Scholar] [CrossRef]
Wang, W.Y.; Tian, W.; Zhu, F.M.; Liu, X.X.; Li, L.X.; Wang, F. MICA, MICB Polymorphisms and Linkage Disequilibrium with HLA-B in a Chinese Mongolian Population. Scand. J. Immunol. 2016, 83, 456–462. [Google Scholar] [CrossRef]
Zou, Y.; Stastny, P.; Süsal, C.; Döhler, B.; Opelz, G. Antibodies against MICA Antigens and Kidney-Transplant Rejection. N. Engl. J. Med. 2007, 357, 1293–1300. [Google Scholar] [CrossRef]
Yonishi, H.; Namba-Hamano, T.; Nakazawa, S.; Yamanaka, K.; Kakuta, Y.; Hashiguchi, H.; Kawano, Y.; Kubota, T.; Tokuchi, M.; Okushima, H.; et al. Acute Antibody-Mediated Rejection Associated with Anti-MICA Antibodies in Long-Term Kidney Transplant: A Case Study. Nephron 2025, 149, 29–34. [Google Scholar] [CrossRef]
Ming, Y.; Hu, J.; Luo, Q.; Ding, X.; Luo, W.; Zhuang, Q.; Zou, Y. Acute Antibody-Mediated Rejection in Presence of MICA-DSA and Successful Renal Re-Transplant with Negative-MICA Virtual Crossmatch. PLoS ONE 2015, 10, e0127861. [Google Scholar] [CrossRef]
Carapito, R.; Aouadi, I.; Verniquet, M.; Untrau, M.; Pichot, A.; Beaudrey, T.; Bassand, X.; Meyer, S.; Faucher, L.; Posson, J.; et al. The MHC class I MICA gene is a histocompatibility antigen in kidney transplantation. Nat. Med. 2022, 28, 989–998. [Google Scholar] [CrossRef]
Isernhagen, A.; Malzahn, D.; Viktorova, E.; Elsner, L.; Monecke, S.; von Bonin, F.; Kilisch, M.; Wermuth, J.M.; Walther, N.; Balavarca, Y.; et al. The MICA-129 dimorphism affects NKG2D signaling and outcome of hematopoietic stem cell transplantation. EMBO Mol. Med. 2015, 7, 1480–1502. [Google Scholar] [CrossRef]
Fuerst, D.; Neuchel, C.; Niederwieser, D.; Bunjes, D.; Gramatzki, M.; Wagner, E.; Wulf, G.; Glass, B.; Pfreundschuh, M.; Einsele, H.; et al. Matching for the MICA-129 polymorphism is beneficial in unrelated hematopoietic stem cell transplantation. Blood 2016, 128, 3169–3176. [Google Scholar] [CrossRef]
Carapito, R.; Jung, N.; Kwemou, M.; Untrau, M.; Michel, S.; Pichot, A.; Giacometti, G.; Macquin, C.; Ilias, W.; Morlon, A.; et al. Matching for the nonconventional MHC-I MICA gene significantly reduces the incidence of acute and chronic GVHD. Blood 2016, 128, 1979–1986. [Google Scholar] [CrossRef]
Nihtilä, J.; Tammi, S.; Salmenniemi, U.; Itälä-Remes, M.; Crossland, R.E.; Gallardo, D.; Bieniaszewska, M.; Giebel, S.; Bogunia-Kubik, K.; Hyvärinen, K.; et al. Impact of MICA-129 Mismatch on Hematopoietic Stem Cell Transplantation Outcomes: Evidence from a Large European Cohort and Meta-Analysis. Transplant. Cell. Ther. 2025, 31, 954.e1–954.e4. [Google Scholar] [CrossRef]
Gam, R.; Shah, P.; Crossland, R.E.; Norden, J.; Dickinson, A.M.; Dressel, R. Genetic association of hematopoietic stem cell transplantation outcome beyond histocompatibility genes. Front. Immunol. 2017, 8, 380. [Google Scholar] [CrossRef][Green Version]
Machuldova, A.; Houdova, L.; Kratochvilova, K.; Leba, M.; Jindra, P.; Ostasov, P.; Maceckova, D.; Klieber, R.; Gmucova, H.; Sramek, J.; et al. Single-Nucleotide Polymorphisms in MICA and MICB Genes Could Play a Role in the Outcome in AML Patients after HSCT. J. Clin. Med. 2021, 10, 4636. [Google Scholar] [CrossRef]
Siemaszko, J.; Dratwa, M.; Szeremet, A.; Majcherek, M.; Czyż, A.; Sobczyk-Kruszelnicka, M.; Fidyk, W.; Solarska, I.; Nasiłowska-Adamska, B.; Skowrońska, P.; et al. MICB Genetic Variants and Its Protein Soluble Level Are Associated with the Risk of Chronic GvHD and CMV Infection after Allogeneic HSCT. Arch. Immunol. Ther. Exp. 2024, 72, 12. [Google Scholar] [CrossRef]
Siemaszko, J.; Łacina, P.; Szymczak, D.; Szeremet, A.; Majcherek, M.; Czyż, A.; Sobczyk-Kruszelnicka, M.; Fidyk, W.; Solarska, I.; Nasiłowska-Adamska, B.; et al. Soluble MICA concentrations and genetic variability of MICA and its NKG2D receptor as factors affecting Graft-versus-Host Disease development after allogeneic haematopoietic stem cell transplantation. Hum. Immunol. 2024, 85, 111147. [Google Scholar] [CrossRef]
Pan, Q.; Ma, X.; You, Y.; Yu, Y.; Fan, S.; Wang, X.; Wang, M.; Gao, M.; Gong, G.; Miao, K.; et al. The impact of ageing on the distribution of preformed anti-HLA and anti-MICA antibody specificities in recipients from eastern China prior to initial HSCT. Immun. Ageing 2024, 21, 15. [Google Scholar] [CrossRef]
Machuldova, A.; Holubova, M.; Caputo, V.S.; Cedikova, M.; Jindra, P.; Houdova, L.; Pitule, P. Role of Polymorphisms of NKG2D Receptor and Its Ligands in Acute Myeloid Leukemia and Human Stem Cell Transplantation. Front. Immunol. 2021, 12, 651751. [Google Scholar] [CrossRef]
Brown, N.K.; Merkens, H.; Rozemuller, E.H.; Bell, D.; Bui, T.-M.; Kearns, J. Reduced PCR-generated errors from a hybrid capture-based NGS assay for HLA typing. Hum. Immunol. 2021, 82, 296–301. [Google Scholar] [CrossRef]
Closa, L.; Vidal, F.; Herrero, M.J.; Caro, J.L. High-throughput genotyping of HLA-G, HLA-F, MICA, and MICB and analysis of frequency distributions in healthy blood donors from Catalonia. HLA 2021, 97, 420–427. [Google Scholar] [CrossRef]
Weimer, E.T.; Newhall, K.A. Machine learning enhanced immunologic risk assessments for solid organ transplantation. Sci. Rep. 2025, 15, 7943. [Google Scholar] [CrossRef]
Ramalhete, L.; Almeida, P.; Ferreira, R.; Abade, O.; Teixeira, C.; Araújo, R. Revolutionizing Kidney Transplantation: Connecting Machine Learning and Artificial Intelligence with Next-Generation Healthcare—From Algorithms to Allografts. BioMedInformatics 2024, 4, 673–689. [Google Scholar] [CrossRef]
Alowidi, N.; Ali, R.; Sadaqah, M.; Naemi, F.M.A. Advancing Kidney Transplantation: A Machine Learning Approach to Enhance Donor–Recipient Matching. Diagnostics 2024, 14, 2119. [Google Scholar] [CrossRef]
Vivek, K.; Papalois, V. AI and Machine Learning in Transplantation. Transplantology 2025, 6, 23. [Google Scholar] [CrossRef]
Ramalhete, L.; Araújo, R.; Teixeira, C.; Teixeira, A.; Almeida, P.; Silva, I.; Lima, A. Evaluation of rapid optimized flow cytometry crossmatch (Halifaster) in living donor kidney transplantation. HLA 2024, 103, e15391. [Google Scholar] [CrossRef]
Koskela, S.; Tammi, S.; Clancy, J.; Lucas, J.A.M.; Turner, T.R.; Hyvärinen, K.; Ritari, J.; Partanen, J. MICA and MICB allele assortment in Finland. HLA 2023, 102, 52–61. [Google Scholar] [CrossRef]
Tammi, S.; Koskela, S.; Hyvärinen, K.; Partanen, J.; Ritari, J. Accurate multi-population imputation of MICA, MICB, HLA-E, HLA-F and HLA-G alleles from genome SNP data. PLoS Comput. Biol. 2024, 20, e1011718. [Google Scholar] [CrossRef]
Le, D.T.; Huynh, T.R.; Burt, B.; Van Buren, G.; Abeynaike, S.A.; Zalfa, C.; Nikzad, R.; Kheradmand, F.; Tyner, J.J.; Paust, S. Natural killer cells and cytotoxic T lymphocytes are required to clear solid tumor in a patient-derived xenograft. JCI Insight 2021, 6, e140116. [Google Scholar] [CrossRef]

Figure 1. Distribution of common MICA and MICB alleles in the study population. (A) Bar plot of the 15 most frequent MICA alleles, showing the number of observed allele copies in 1364 individuals. (B) Bar plot of the 15 most frequent MICB alleles, displayed in the same manner. Both panels illustrate the dominance of a limited set of high-frequency alleles and a long tail of less common variants.

Figure 2. Distribution of multilocus heterozygosity at MICA and MICB. Histogram of the number of heterozygous loci per individual, considering MICA and MICB jointly. Bars show the counts of individuals who are homozygous at both loci, heterozygous at a single locus or heterozygous at both loci. The distribution illustrates the high prevalence of multilocus heterozygosity in this population.

Figure 3. Co-occurrence of frequent MICA and MICB alleles. Heatmap of co-occurrence counts for the 15 most frequent MICA alleles (rows) and 15 most frequent MICB alleles (columns). The color scale reflects the number of individuals carrying each allele pair across the two loci. Non-randomly enriched combinations suggest underlying haplotypic structure in the MICA–MICB region.

Figure 4. LD-like r² between common MICA and MICB alleles. Heatmap of LD-like r² values for the ten most frequent MICA alleles (rows) and ten most frequent MICB alleles (columns). High r² values indicate strong non-random association between specific MICA–MICB allele pairs, consistent with conserved haplotypic blocks within the non-classical MHC region.

Figure 5. Silhouette-based assessment of the number of k-means clusters. Average silhouette coefficient for k-means clustering solutions with k from 2 to 7, based on a binary presence/absence representation of all observed MICA and MICB alleles. The curve shows a broad maximum around k = 3, supporting the use of a three-cluster solution as a balance between cluster separation and model simplicity.

Figure 6. Principal component analysis of multilocus MICA/MICB genotypes. Two-dimensional PCA plot of individuals encoded by MICA and MICB allele presence/absence. Each point represents one individual and is colored according to k-means cluster assignment (k = 3). Partially overlapping but structured clouds indicate continuous genetic variation organized around a few dominant MICA/MICB backgrounds.

Figure 7. t-SNE embedding of multilocus MICA/MICB genotypes colored by k-means clusters. Two-dimensional t-SNE map derived from the same binary genotype matrix used in PCA. Each point corresponds to an individual, colored by k-means cluster membership. The non-linear embedding emphasizes local neighborhood relationships and reveals visually distinct “clouds” corresponding to major MICA/MICB genetic backgrounds.

Figure 8. UMAP embedding of multilocus MICA/MICB genotypes colored by k-means clusters. Uniform manifold approximation and projection (UMAP) of individuals encoded by MICA and MICB allele presence/absence. As in Figure 6, each point represents an individual and is colored according to k-means cluster. The UMAP representation provides an alternative view of the same structure, with dense regions and smooth transitions between cluster-enriched areas.

Figure 9. Hierarchical clustering of individuals based on MICA/MICB allele presence. Dendrogram of individuals clustered by average-linkage hierarchical clustering using a Jaccard distance matrix derived from MICA and MICB allele presence/absence. The tree reveals gradual branching without clearly separated subtrees, consistent with continuous multilocus structure rather than sharply defined subpopulations.

Figure 10. Heatmap of pairwise similarity between individuals. Heatmap of 1—Jaccard distance between individuals, ordered according to the dendrogram in Figure 9. Warmer colors correspond to higher similarity in MICA/MICB allele repertoires. Block-like patterns indicate sets of individuals sharing closely related multilocus backgrounds.

Figure 11. Presence/absence matrix of frequent MICA and MICB alleles across individuals. Bicluster-style heatmap displaying individuals (rows) and selected frequent MICA and MICB alleles (columns). Cells indicate presence (1) or absence (0) of each allele in each individual. Both rows and columns are ordered by hierarchical clustering, highlighting co-occurring alleles and groups of individuals with similar MICA/MICB profiles.

Figure 12. Jaccard similarity between frequent MICA and MICB alleles. Heatmap of pairwise Jaccard similarity indices among the ten most frequent MICA and ten most frequent MICB alleles, based on co-carriage in individuals. Rows and columns include both MICA and MICB alleles; higher values indicate alleles that tend to occur in the same individuals. The clustered structure highlights groups of alleles that participate in shared haplotypic backgrounds.

Figure 13. t-SNE map of recurrent MICA–MICB genotype constellations with cluster regions. Two-dimensional t-SNE embedding of four-allele MICA–MICB genotype constellations observed in at least three individuals. Each point represents a distinct genotype constellation and is colored according to k-means cluster assignment in the t-SNE plane. Background decision regions, estimated by a k-nearest neighbors classifier, illustrate how clusters tile the haplotype-constellation space and differentiate common central constellations from rare peripheral ones.

Figure 14. Co-occurrence of common MICA and MICB genotypes. Heatmap of counts for the most frequent MICA genotypes (rows) and MICB genotypes (columns). Cell colors reflect the number of individuals with each MICA–MICB genotype combination. The matrix summarizes the dominant genotype backbones in the cohort and complements the constellation-based representation in Figure 13.

Table 1. High-resolution MICA and MICB allele frequencies in 1364 individuals from southern Portugal. Allele counts and relative frequencies for all MICA and MICB alleles identified by next-generation sequencing in 1364 unrelated individuals. Each row corresponds to a distinct allele; columns indicate locus, allele name, number of observed allele copies and percentage of all alleles at that locus.

MICA Allele Frequencies			MICB Allele Frequencies
Allele	Count	Frequency_percent	Allele	Count	Frequency_percent
MICA*002:01	433	15.8724	MICB*005:02	1097	40.2126
MICA*008:01	433	15.8724	MICB*002:01	557	20.4179
MICA*004:01	365	13.3798	MICB*004:01	404	14.8094
MICA*008:04	255	9.3475	MICB*008:01	246	9.0176
MICA*009:01	212	7.7713	MICB*005:03	101	3.7023
MICA*009:02	129	4.7287	MICB*005:06	101	3.7023
MICA*011:01	114	4.1789	MICB*003:01	81	2.9692
MICA*016:01	113	4.1422	MICB*005:01	53	1.9428
MICA*018:01	88	3.2258	MICB*024:01	23	0.8431
MICA*001:01	68	2.4927	MICB*013:01	18	0.6598
MICA*010:01	66	2.4194	MICB*014:01	13	0.4765
MICA*049:01	55	2.0161	MICB*005:08	9	0.3299
MICA*019:01	55	2.0161	MICB*028	8	0.2933
MICA*017:01	55	2.0161	MICB*033	4	0.1466
MICA*007:01	53	1.9428	MICB*012	3	0.11
MICA*027:01	48	1.7595	MICB*009:01:01N	3	0.11
MICA*012:01	46	1.6862	MICB*021:01:01N	2	0.0733
MICA*015:01	25	0.9164	MICB*031:01	2	0.0733
MICA*011:01:04Q	25	0.9164	MICB*023	1	0.0367
MICA*008:13	11	0.4032	MICB*005:14	1	0.0367
MICA*006	10	0.3666	MICB*038	1	0.0367
MICA*008:02	9	0.3299
MICA*068:01	9	0.3299
MICA*041	7	0.2566
MICA*110	5	0.1833
MICA*101	4	0.1466
MICA*052	3	0.11
MICA*185	3	0.11
MICA*001:02	3	0.11
MICA*045:01	2	0.0733
MICA*012:03	2	0.0733
MICA*119:01	2	0.0733
MICA*153	2	0.0733
MICA*029:03	1	0.0367
MICA*046	1	0.0367
MICA*029:01	1	0.0367
MICA*022:01	1	0.0367
MICA*056	1	0.0367
MICA*007:03	1	0.0367
MICA*141	1	0.0367
MICA*030	1	0.0367

Table 2. Diversity metrics for MICA and MICB. Summary of locus-level diversity indices for MICA and MICB, including number of distinct alleles, observed heterozygosity (Ho), expected heterozygosity (He) and Shannon entropy. Values quantify overall allelic richness and evenness of the MICA and MICB allele distributions in the study population.

Locus	Individuals (n)	Alleles (n)	Ho	He	Shannon
MICA	1364	42	0.897	0.907	2.719
MICB	1364	22	0.742	0.762	1.795

Table 3. Multilocus heterozygosity at MICA and MICB. Distribution of individuals according to heterozygosity status at MICA and MICB: homozygous at both loci, heterozygous at one locus only, or heterozygous at both loci. Values are shown as absolute counts and percentages of the 1364 individuals.

	MICA_heterozygous	MICB_heterozygous	Total_heterozygous_loci
count	1364.0	1364.0	1364.0
mean	0.897	0.742	1.639
std	0.305	0.438	0.566
min	0.0	0.0	0.0
25%	1.0	0.0	1.0
50%	1.0	1.0	2.0
75%	1.0	1.0	2.0
max	1.0	1.0	2.0

Table 4. Hardy–Weinberg equilibrium analysis for the most frequent MICA and MICB alleles. Goodness-of-fit tests for Hardy–Weinberg equilibrium for the major MICA and MICB alleles in the cohort. For each allele, the table reports the number of genotyped individuals, estimated allele frequency (p), observed and expected genotype counts, chi-square statistic and associated p value. Results provide a quality-control check for genotyping accuracy and gross deviations from equilibrium.

Locus	Major Alleles	n	p	Chi 2	p Value
MICA	MICA*002:01	1364	0.159	2.398	0.302
MICB	MICB*005:02	1364	0.402	3.425	0.180

Table 5. Top MICA and MICB alleles by k-means genotype cluster. Relative frequencies of the most informative MICA and MICB alleles within each of the three k-means clusters defined on the allele presence/absence matrix. The table highlights alleles that differentiate clusters and illustrates the partially overlapping multilocus genetic backgrounds.

Cluster_method	Cluster	Locus	Allele	Count
Cluster_kmeans	0	MICA	MICA*008:01	178
Cluster_kmeans	0	MICA	MICA*004:01	130
Cluster_kmeans	0	MICA	MICA*002:01	120
Cluster_kmeans	0	MICA	MICA*011:01	71
Cluster_kmeans	0	MICA	MICA*009:01	56
Cluster_kmeans	1	MICA	MICA*002:01	216
Cluster_kmeans	1	MICA	MICA*009:01	123
Cluster_kmeans	1	MICA	MICA*004:01	107
Cluster_kmeans	1	MICA	MICA*008:04	97
Cluster_kmeans	1	MICA	MICA*008:01	91
Cluster_kmeans	2	MICA	MICA*008:01	164
Cluster_kmeans	2	MICA	MICA*008:04	138
Cluster_kmeans	2	MICA	MICA*004:01	128
Cluster_kmeans	2	MICA	MICA*002:01	97
Cluster_kmeans	2	MICA	MICA*009:02	59
Cluster_kmeans	0	MICB	MICB*005:02	667
Cluster_kmeans	0	MICB	MICB*008:01	86
Cluster_kmeans	0	MICB	MICB*005:03	42
Cluster_kmeans	0	MICB	MICB*005:06	29
Cluster_kmeans	0	MICB	MICB*013:01	6
Cluster_kmeans	1	MICB	MICB*002:01	557
Cluster_kmeans	1	MICB	MICB*005:02	218
Cluster_kmeans	1	MICB	MICB*004:01	86
Cluster_kmeans	1	MICB	MICB*008:01	36
Cluster_kmeans	1	MICB	MICB*005:06	27
Cluster_kmeans	2	MICB	MICB*004:01	318
Cluster_kmeans	2	MICB	MICB*005:02	212
Cluster_kmeans	2	MICB	MICB*008:01	124
Cluster_kmeans	2	MICB	MICB*003:01	57
Cluster_kmeans	2	MICB	MICB*005:06	45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ramalhete, L.; Almeida, P.; Araújo, R.; Espada, E. Integrative Population Analysis of MICA and MICB Using Unsupervised Machine Learning in a Large Histocompatibility Laboratory Cohort. J 2026, 9, 8. https://doi.org/10.3390/j9010008

AMA Style

Ramalhete L, Almeida P, Araújo R, Espada E. Integrative Population Analysis of MICA and MICB Using Unsupervised Machine Learning in a Large Histocompatibility Laboratory Cohort. J. 2026; 9(1):8. https://doi.org/10.3390/j9010008

Chicago/Turabian Style

Ramalhete, Luis, Paula Almeida, Ruben Araújo, and Eduardo Espada. 2026. "Integrative Population Analysis of MICA and MICB Using Unsupervised Machine Learning in a Large Histocompatibility Laboratory Cohort" J 9, no. 1: 8. https://doi.org/10.3390/j9010008

APA Style

Ramalhete, L., Almeida, P., Araújo, R., & Espada, E. (2026). Integrative Population Analysis of MICA and MICB Using Unsupervised Machine Learning in a Large Histocompatibility Laboratory Cohort. J, 9(1), 8. https://doi.org/10.3390/j9010008

Article Menu

Integrative Population Analysis of MICA and MICB Using Unsupervised Machine Learning in a Large Histocompatibility Laboratory Cohort

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design and Anonymized Dataset

2.2. Sample Collection and DNA Extraction

2.3. High-Resolution MICA and MICB Typing

2.4. Data Curation and CODING Strategy

2.5. Population-Genetic Analyses

2.6. Inter-Locus Association and Linkage Disequilibrium

2.7. Combined MICA–MICB Genotype Constellations

2.8. Machine-Learning Encoding and Clustering

2.9. Dimensionality-Reduction and Manifold Projection

2.10. Haplotype-Level Low-Dimensional Mapping

2.11. Similarity Metrics and Hierarchical Clustering

2.12. Software Environment and Reproducibility

3. Results and Discussion

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI