1. Introduction
Colorectal cancer (CRC) ranks among the leading malignancies worldwide in terms of both incidence and mortality, with over 1.9 million new cases and approximately 900,000 deaths annually [
1]. Despite advances in screening and targeted therapies, the 5-year survival rate for metastatic colorectal cancer remains below 15% [
2], underscoring the persistent need for optimized treatment strategies [
3,
4]. CRC exhibits pronounced molecular and biological heterogeneity, among which microsatellite instability (MSI) represents one of the most clinically significant molecular subtypes, exerting a profound impact on tumor immune characteristics and therapeutic responses [
4,
5]. MSI primarily arises from defects in the DNA mismatch repair (MMR) system, leading to compromised genomic stability. Accordingly, tumors with deficient mismatch repair are classified as dMMR, whereas those with intact MMR function and stable microsatellites are defined as proficient MMR (pMMR) or microsatellite stable (MSS) tumors [
5,
6,
7,
8]. Clinically, approximately 15% of sporadic colorectal cancers and the majority of Lynch syndrome-associated tumors display MSI features, while the remaining ~85% of cases are classified as MSS [
5,
9,
10]. Accumulating evidence indicates that MSI colorectal cancers are typically associated with a more immunologically active tumor microenvironment, characterized by increased T-cell infiltration, elevated expression of immune-related genes, and enhanced immune activation signaling. These features have rendered MSI CRC one of the earliest colorectal cancer subtypes shown to exhibit a pronounced clinical response to immune checkpoint inhibitors, thereby driving the exploration and application of immunotherapy in this disease setting [
11].
Therapies targeting the PD-1/PD-L1 pathway have achieved remarkable success in colorectal cancer with dMMR/MSI-H features, as well as in other solid tumors [
12]. This therapeutic efficacy is commonly attributed to the high mutational burden of MSI-H tumors, which generates abundant neoantigens [
5], thereby eliciting pre-existing effector T cell-mediated anti-tumor immune responses [
13]. However, with deeper clinical experience, the simplistic paradigm that “MSI-H equals immunotherapy sensitivity” is increasingly challenged [
12,
14,
15,
16]. Studies have shown that not all MSI-H patients derive durable benefit from treatment, revealing significant response heterogeneity within this subgroup. Concurrently, encouraging treatment responses have been observed in a subset of MSS patients [
17,
18,
19], with some cases even demonstrating superior efficacy compared to MSI-H patients [
15]. As the core effector cells of immune checkpoint blockade (ICB), the functional state of T cells directly dictates the success of anti-tumor responses. The incomplete concordance between MSI status and clinical benefit indicates that MSI merely provides a potential for immune activation, whereas the true therapeutic bottleneck lies in whether T cells can effectively infiltrate tumors, undergo clonal expansion, and sustain their functional capacity within the tumor microenvironment [
20]. These observations suggest that the key determinants of immunotherapy response may extend beyond MSI status itself and are more deeply rooted in the precise shaping and fate regulation of T cell clones by the tumor immune microenvironment. Notably, T cell-mediated antitumor immunity is not confined to the tumor site but instead operates within a dynamic, cross-tissue immune network. Previous studies have demonstrated that during immunotherapy for colorectal cancer, the expansion of intratumoral T-cell clones is often closely accompanied by synchronous changes in related clones in the peripheral blood, highlighting a critical role for cross-tissue clonal trafficking and selection in shaping therapeutic responses [
21]. However, most existing studies focus on single tissue compartments, making it difficult to distinguish, at the clonal level, between local expansion, peripheral replenishment, and functional state transitions. This limitation hampers a systematic understanding of the dynamic mechanisms underlying responses to immunotherapy [
22].
The T-cell receptor (TCR) serves as the key molecule enabling T cells to specifically recognize antigens and initiate adaptive immune responses. The diversity in its structure and function forms the core foundation of anti-tumor immunity. Composed of α and β chains (or γδ chains), TCRs generate enormous diversity through V(D)J recombination, thereby conferring upon T cells the ability to recognize a nearly unlimited array of antigens [
23]. Within the tumor microenvironment (TME), TCRs recognize tumor antigen peptides presented by major histocompatibility complex (MHC) molecules, activating T cells and directing their cytotoxic functions [
24]. Consequently, the TCR acts not only as the initiating hub of the immune response but also as a crucial bridge linking tumor antigens to T-cell effector functions. In recent years, advances in high-throughput sequencing technologies have established scTCR sequencing (scTCR-Seq) as a powerful tool for dissecting T-cell clonal composition, tracking the dynamics of specific clones, and assessing the quality of immune responses [
23,
24]. However, reliance on TCR sequence information alone remains insufficient to comprehensively characterize the functional states of T cells within the tumor microenvironment. An increasing body of evidence indicates that integrating TCR clonal information with single-cell transcriptomic data offers indispensable value for resolving the transcriptional features of the same clone across distinct differentiation stages, functional states, or immune stress conditions [
25]. Nevertheless, in multi-tissue and longitudinal single-cell study settings, the stable and accurate reconstruction of full-length TCR sequences from transcriptomic data, together with their reliable pairing to cellular functional states, remains a major technical bottleneck limiting the widespread application of this strategy [
26]. In this context, methods capable of directly reconstructing complete TCR sequences from single-cell RNA-sequencing data have emerged as one of the key avenues for enabling integrated analyses of TCR clonality and transcriptional states.
At present, multiple computational tools for TCR reconstruction from transcriptomic data have been widely applied, including DERR [
27], Cell Ranger VDJ, MiXCR [
28], and TRUST4 [
29]. These methods have achieved substantial improvements in sensitivity and accuracy, enabling systematic analyses of T-cell clonality without the need for additional dedicated TCR sequencing. However, a considerable proportion of the TCR sequences assembled by current approaches are still classified as non-full-length chains, which imposes clear limitations in research contexts requiring precise α/β chain pairing, TCR structural modeling, or functional validation. For studies centered on integrated TCR-transcriptome analyses, this limitation to some extent constrains the depth of fine-grained clonal resolution and functional association in downstream analyses.
To address these research gaps, we integrated multi-tissue, longitudinal, and paired single-cell data from patients with MSI and MSS colorectal cancer before and after anti-PD-1 therapy, and established a systematic analytical framework. First, by optimizing the TCR reconstruction pipeline, we precisely characterized baseline TCR repertoire features to delineate fundamental differences between MSI and MSS tumors. Second, by analyzing patterns of clonotype sharing, we investigated treatment-driven T-cell selection dynamics. We then mapped clonal dynamics onto specific T-cell functional subsets to elucidate functional state transitions accompanying clonal expansion or contraction. Finally, we evaluated the translational potential of these findings based on key gene features. Overall, this study aims to dissect how microsatellite instability regulates the immune response in colorectal cancer at the clonal level, providing new insights for immune-biological stratification beyond traditional classification.
3. Discussion
This study adopts a cross-tissue, longitudinal single-cell integrative framework that jointly analyzes scRNA-seq and scTCR-seq data to comparatively examine the tissue distribution, clonal evolution, and functional fate of T-cell responses in microsatellite-stable (MSS) and microsatellite-instable (MSI) colorectal cancers. In contrast to previous studies that primarily focused on a single tissue compartment or a single time point, this framework emphasizes the coordinated assessment of clonal connectivity and functional remodeling across peripheral blood, adjacent normal tissue, and tumor tissue in a treatment-relevant context. This holistic approach enables the differentiation of distinct immunodynamic mechanisms, including local expansion, peripheral recruitment, and functional transition. Within this analytical framework, we are able to elevate the understanding of T-cell immune state differences from static descriptions at the “compositional level” to a dynamic comprehension of “clonal selection and functional fate,” thereby providing a more explanatory analytical perspective for deciphering the differences in immunotherapy responses under different microsatellite statuses.
Our study shows that MSS colorectal cancer (CRC) exhibits a “high baseline-strong suppression” immune phenotype. Adjacent non-tumor tissues are enriched in memory CD8
+ T cells (e.g., c05_CD8_Trm and c02_CD8_Tem), indicating substantial immune reserves, which is consistent with the concept that peritumoral tissues serve as immune cell reservoirs [
33]. However, the tumor microenvironment is characterized by widespread accumulation of immunosuppressive cytokines and T-cell dysfunction [
34,
35]. Our TCR clonality analysis further revealed a global contraction of the TCR repertoire, high conservation of shared clones, and, notably, a failure of effector clones to expand—indeed, a decline—following PD-1 blockade. These clonal-level features provide a mechanistic explanation for the limited efficacy of PD-1 monotherapy in patients with MSS CRC.
In contrast, MSI CRC, driven by high mutational burden and abundant neoantigen generation, elicits robust T-cell recruitment and clonal expansion and is therefore more responsive to PD-1/PD-L1 inhibitors [
13], accompanied by pronounced T-cell functional remodeling and exhaustion programs [
36]. In our study, MSI tumors displayed significantly increased TCR diversity and dramatic expansion of effector CD8
+ T-cell populations (e.g., c02_CD8_Tem and c07_CD8_prolif_T), reflecting sustained and intense immune activation. At the same time, terminally exhausted T cells (c13_CD4_Tex and c09_CD8_Tex) and Treg cells (c11_CD4_Treg) were markedly expanded, with the magnitude of exhausted clone expansion exceeding that of effector clones. This “high-fluctuation-deep exhaustion” clonal trajectory suggests that, although the MSI microenvironment possesses strong activating capacity, it can rapidly drive newly generated effector T cells into irreversible dysfunction. This provides a plausible explanation for primary resistance to PD-1 inhibitors in a subset of MSI patients: once T cells have entered terminal exhaustion under multiple suppressive cues, blockade of the PD-1 pathway alone may be insufficient to restore function [
37,
38]. Recent multi-omics studies have further identified DUB-H and DUB-L subtypes within MSS CRC based on the expression of immune-related deubiquitinating enzymes (IR-DUBs). The DUB-L subtype is characterized by higher immune infiltration, stronger T-cell inflammatory signatures, and improved relapse-free survival, whereas high USP7 expression is associated with immune desert and immunosuppressive states [
18]. Collectively, these results underscore the intricate and heterogeneous nature of the colorectal cancer immune microenvironment, which cannot be adequately captured by a binary MSS/MSI-based classification into “cold” or “hot” tumors. A more refined molecular and functional stratification is therefore necessary, and our analytical approach offers a robust and scalable framework to achieve this goal.
Based on these findings, we constructed a prognostically relevant T-cell functional-state signature and validated it in the TCGA cohort. This signature integrates multiple key genes involved in T-cell exhaustion and functional regulation, including markers reflecting cellular stress and dysfunction. Notably, NR4A1 was prominently upregulated in high-risk patients; this transcription factor has been well documented to drive T-cell exhaustion and impair effector function [
39]. CXCR4 has been shown to promote metastasis and immunosuppression in colorectal cancer [
40], while CCL4 can recruit regulatory T cells and myeloid-derived suppressor cells, contributing to the formation of an immunosuppressive tumor microenvironment [
41]. The expression pattern of this signature is characterized by elevated exhaustion/stress markers together with suppressed memory and effector potential (e.g., reduced expression of IL7R and GZMK), and it outperformed MSI status alone in stratifying patient survival. Our model suggests that combinatorial interventions targeting exhaustion pathways (such as NR4A1), or strategies aimed at enhancing T-cell persistence to improve functional fitness, may help overcome immunotherapy resistance in MSI colorectal cancer.
Despite the high-resolution insights gained from our spatiotemporal integrative framework, this study has several limitations that warrant awareness. First, the primary discovery cohort consists of a relatively small number of patients (n = 6). Although we leveraged a dense sampling strategy—analyzing 43 longitudinal specimens across multiple tissue compartments—the limited “n” per microsatellite subtype (MSI vs. MSS) may restrict the generalizability of certain rare T-cell clonal trajectories. We consider this work a pilot study that establishes a novel “clonal fate-centered” paradigm rather than a definitive clinical census. Second, while our findings on T-cell exhaustion and clonal contraction were robustly validated in the large-scale TCGA cohort, the lack of an independent, longitudinal single-cell validation set remains a constraint. Future studies involving larger multi-center cohorts will be essential to confirm these immunodynamic patterns across diverse treatment regimens. Nevertheless, our study provides a scalable analytical framework and offers a pioneering perspective on how clonal evolution, rather than static composition, dictates immunotherapy outcomes in colorectal cancer.
In summary, this study developed and applied TORBiT, an in-house toolkit for high-accuracy reconstruction of full-length TCRs from scRNA data, and through integrative analysis of cross-tissue, longitudinal single-cell and TCR clonotype data, revealed fundamental differences in T-cell immune evolutionary trajectories between MSI and MSS colorectal cancers. MSS tumors display a “strongly suppressive” phenotype characterized by clonal contraction and functional silencing, whereas MSI tumors follow a “high-fluctuation, deep-exhaustion” trajectory in which intense immune activation develops in parallel with terminal exhaustion programs. Molecular features derived from T-cell functional states indicate that patient prognosis is primarily determined by T-cell functional quality rather than the magnitude of immune infiltration, suggesting that reliance on MSI status or conventional “hot/cold” classifications alone may be insufficient to capture the biological basis of immune responses. Overall, this study proposes a T-cell clonal fate-centered analytical paradigm, providing a scalable framework for dissecting immune heterogeneity in colorectal cancer at both single-cell and clonal levels.
4. Methods
4.1. Data Collection
To characterize the cellular composition and functional states in colorectal cancer under microsatellite-stable (MSS) and microsatellite-instable (MSI) conditions, data were obtained from the National Genomics Data Center (NGDC) under accession number GSA HRA005546. The author team obtained permission for in-depth analysis of these data through a collaborative agreement. Library construction (10× Genomics 5′ V(D)J and 3′ RNA-seq) and sequencing (Illumina NovaSeq 6000, 150 bp paired-end reads) were performed by the original research group following the manufacturer’s standard protocols. Specifically, all samples were processed using the 10× Chromium platform (10× Genomics, Pleasanton, CA, USA) for both 3′ RNA-seq and V(D)J enrichment library preparation. The purified libraries were subsequently sequenced on the Illumina NovaSeq platform (Illumina, San Diego, CA, USA) with 150 bp paired-end reads. Read alignment and initial expression matrix generation were performed by the original research group using the Cell Ranger single-cell toolkit (10× Genomics, Pleasanton, CA, USA; version 6.1.2) with the GRCh38 human reference genome. The raw single-cell RNA sequencing data described above are publicly available through the GEO database under accession number GSE236581. After standardized quality control, we retained a total of 175,930 high-quality cells for downstream analysis. Detailed sample information is provided in
Supplementary Table S1.
To extend our findings from the single-cell level to a larger cohort and assess their clinical relevance, we further integrated bulk transcriptomic data from The Cancer Genome Atlas (TCGA) database (
https://www.cancer.gov/ccg/research/genome-sequencing/tcga; accessed on 25 October 2025). Specifically, transcriptomic and clinical metadata from the TCGA-COAD and TCGA-READ projects were retrieved. A rigorous preprocessing pipeline was implemented to ensure data integrity: transcriptomic profiles were cross-matched with clinical metadata using unique sample identifiers, retaining only patients with both high-quality expression profiles and confirmed microsatellite instability (MSI) status. Based on the clinical MSI_group information, samples were stratified into MSI-high (MSI-H, n = 53) and microsatellite-stable (MSS, n = 255) groups, while patients with ambiguous or missing clinical labels were excluded. This resulted in a finalized validation cohort of 308 samples. For downstream analysis, raw expression counts were converted to Transcripts Per Million (TPM) and log-transformed (log
2(TPM + 1)) to eliminate sequencing depth bias and optimize the data distribution for immune deconvolution and survival modeling.
4.2. scRNA-Seq Data Processing
scRNA-seq data were analyzed using the R package Seurat (version 4.4.1). Rigorous quality control was performed to filter out low-quality cells based on two criteria: the number of detected genes and the proportion of mitochondrial gene counts. Specifically, cells meeting either of the following conditions were removed: (1) fewer than 200 or more than 6000 detected genes, or (2) mitochondrial gene content exceeding 5%. The NormalizeData function was applied for library-size correction and log-transformation, and the resulting expression matrix was used for downstream analysis.
4.3. Integration, Unsupervised Dimensionality Reduction, Clustering, and Cell Type Identification of Single-Cell Sequencing Data
Subsequently, we adapted the Seurat workflow to perform dimensionality reduction and unsupervised clustering. First, 2000 highly variable genes (HVGs) were selected using the FindVariableFeatures function with the parameter selection.method = “vst”. Next, the effects of total UMI counts and mitochondrial gene percentage were regressed out from the HVG expression matrix using the ScaleData function. Dimensionality reduction was then performed on the scRNA-seq data via the RunPCA function. Since our samples were collected from blood, adjacent normal tissue, and tumor tissue at multiple time points before and after anti-PD-1 immunotherapy and were processed in batches, we applied RunHarmony from the Harmony package (version 1.2.4) to identify anchors, perform integration, and remove batch effects. Principal components (PCs) were selected by ranking them using the ElbowPlot function in Seurat, which randomly permutes subsets of the data and computes projected PCA scores. When the elbow point was reached at the 30th principal component, the first 30 PCs were used for UMAP (Uniform Manifold Approximation and Projection) analysis via the RunUMAP function. Subsequently, the single-cell landscape was visualized by applying the FindClusters function with a resolution of 0.1. Cell clusters were then annotated based on canonical marker genes. For T-cell subpopulation clustering, the resolution was increased to 2, while all other parameters remained unchanged.
4.4. T-Cell Subtype Enrichment and Expansion
To quantitatively assess T-cell dynamic changes induced by anti-PD-1 therapy under different microsatellite statuses in colorectal cancer, we analyzed the distribution differences in T-cell subtypes between the tumor microenvironment and peripheral normal tissues, as well as changes in clonal size before and after treatment. Calculations were based on single-cell data from each patient at baseline (pre-treatment) and the first post-treatment sampling.
Let Tpre denote the abundance of a given T-cell subtype in tumor tissue at the first sampling time point (pre-treatment), and Npre denote the corresponding abundance in normal tissue at the same pre-treatment time point.
Tissue Enrichment Score:
This score quantifies the inherent distribution preference of a specific T-cell subtype between tumor and normal tissues. It is calculated as follows:
E > 0: indicates enrichment of the T-cell subtype in tumor tissue.
E < 0: indicates enrichment of the T-cell subtype in normal tissue.
Tumor Response Score:
This index measures the change in clonal size of a specific T-cell subtype after treatment relative to baseline. It is calculated as follows:
Δ > 0: indicates relative expansion of the T-cell subtype after treatment.
Δ < 0: indicates relative contraction of the T-cell subtype after treatment.
4.5. TCR Reconstruction Pipeline Design
For raw sequencing reads, alignment was first performed using BWA (version 0.7.18) [
42] to filter reads originating from TCR regions. The alignment results were then converted to FASTQ format using samtools (version 1.17) [
43]. For bulk sequencing data, the assembly module Trinity (version 2.1.1) [
44] was directly invoked for batch processing. For single-cell sequencing data, a barcode-embedding strategy was introduced to enable single-cell-level parsing and demultiplexing, as follows: (1) Single-end strategy: Based on the characteristic that single-end reads share the same sequencing identifier, barcode information was embedded into the sequence ID, and clustering was performed according to the newly generated names to achieve single-cell-level splitting. (2) Paired-end strategy: For paired-end reads, barcode information was embedded into the IDs of both forward (F) and reverse (R) reads, followed by barcode-based clustering and splitting. To improve processing efficiency, multi-process parallelization was employed, and the assembled contigs were consolidated into a new FASTA file. Finally, functional annotation was carried out using the standalone annotation module of TRUST4 (version 1.1.5) [
29] to extract complete TCR sequences. The code TORBiT is available at
https://github.com/XieBioLab/TORBiT (accessed on 1 December 2025).
4.6. TCR Reconstruction Pipeline Evaluation
TCR information for each cell was reconstructed separately from single-cell TCR sequencing (scTCR-seq) and single-cell RNA sequencing (scRNA-seq) data. We defined the number of TCR chains reconstructed from scTCR-seq data as the benchmark. When a TCR chain reconstructed from scRNA-seq data for a given cell was successfully matched to a corresponding chain in the scTCR-seq benchmark, it was considered a correct TCR ligand for that cell; otherwise, it was classified as incorrect. The accuracy rate was calculated as the proportion of correctly reconstructed TCR chains relative to the total number of TCR chains, using the following formula:
To evaluate the performance of our TCR identification pipeline, we conducted a comprehensive benchmark comparison against TRUST4, a widely used tool for TCR sequence analysis. The evaluation was performed on a single-cell RNA sequencing dataset containing 6,614,682 raw sequencing reads. Both tools processed the same input data, and their outputs were systematically compared using two key metrics: recall rate and precision rate. Recall rate (sensitivity) was calculated as the proportion of original reads successfully identified and clustered by each tool, defined as: Recall rate = (Number of identified reads/Total original reads) × 100%. Precision rate (positive predictive value) was calculated as the proportion of identified reads successfully annotated as genuine TCR sequences, evaluated for each TCR component (V, D, J genes and CDR3 region): Precision rate = (Number of true positive reads/Total identified reads) × 100%. For each tool, we quantified: (1) the number of clustered sequences after the initial alignment step, (2) the number of sequences that obtained successful TCR annotations, and (3) the annotation success rates for V, D, J genes and the CDR3 region. Special emphasis was placed on evaluating D-gene annotation capability.
4.7. TCR Chain Filtering Criteria
Contigs assembled by our pipeline were filtered to select T-cell receptor (TCR) sequences, with the integrity and conservation of the CDR3 region serving as the core filtering criteria. To ensure analytical accuracy and biological relevance, only TCR chains containing complete variable region gene segments were retained as valid sequences for downstream analyses. Specifically, for T-cell receptor α (TRA) and γ (TRG) chains, a CDR3 amino acid sequence was considered valid if it started with a cysteine (C) and ended with either tryptophan (W) or phenylalanine (F). This criterion captures the typical C…F/W terminal pattern of CDR3 regions in these subtypes, consistent with the conserved features encoded by their J-region genes. For the more stringently conserved T-cell receptor β (TRB) and δ (TRD) chains, the CDR3 sequence was required to begin with the highly conserved “CASS” motif (cysteine-alanine-serine-serine) and end with phenylalanine (F) [
45,
46,
47,
48].
4.8. Individual Clonotype Expansion Proportion
Data from the same tissue of the same patient at different time points were integrated. The total number of clonotypes across both time points was set as 100%, and the proportion of clonotypes at each time point was calculated separately to assess induced expansion or contraction following treatment. Overall changes in clonal composition were evaluated by comparing the distribution of clonal frequencies (Ri) across all clonotypes, and further quantified using statistical measures such as the Clonal Expansion Index (CEI):
Peripheral Blood Baseline TCR Repertoire Clonality and Diversity Assessment:
To quantitatively assess differences in the clonal structure of the peripheral blood TCR repertoire between MSI and MSS colorectal cancer patients before receiving anti-PD-1 therapy, we performed the following analyses on pre-treatment (baseline) peripheral blood samples: Full-length TCR sequences were reconstructed from scRNA-seq data of each patient’s pre-treatment peripheral blood sample, and information including clonal frequency (Clones), CDR3 amino acid sequence (CDR3.aa), and V/D/J gene usage was recorded for each TCR sequence. Data cleaning and format standardization were carried out using a custom R script (extended from the immunarch package) to ensure the completeness and reliability of the clonotype list for each sample.
Clonality Score Calculation:
The clonality score was defined as the cumulative frequency of the top 10% high-frequency clones (ranked by Clones) relative to the total number of clones in the sample. The formula is as follows:
where k = max(1,[0.1 × n]), and n denotes the number of unique clonotypes in the sample. This metric reflects the concentration of dominant clones within the TCR repertoire.
Diversity Score Calculation:
The diversity score was estimated as the ratio of clonotype richness (Unique Clones) to the total number of clones (Total Clones), i.e.,
A higher value of this ratio indicates a more even distribution of clonotypes within the TCR repertoire, reflecting greater diversity. Differences in clonality and diversity scores between the MSS and MSI groups were compared using independent two-sample t-tests, with a significance level set at p < 0.05. All calculations and visualizations were performed in the R environment (version 4.5.1), using ggplot2 for plotting and ggpubr for adding statistical annotations.
4.9. Private and Shared Clonotypes
Clonotype sharing was defined as identity in both variable gene selection and CDR3 amino acid sequence for either the α- or β-chain. A clonotype present in only one patient was classified as private. If a clonotype was found in multiple patients within either the MSI or MSS group, it was termed intra-group shared; if it appeared across both MSI and MSS groups, it was designated inter-group shared. To visualize the tissue distribution of highly shared T-cell clonotypes, we constructed Sankey diagrams based on single-cell TCR sequencing data. First, clonotypes present in both normal tissue (Normal) and tumor tissue (Tumor) were selected from the complete clonotype dataset, and the total number of patients carrying each clonotype, as well as the patient count stratified by MSI status, were summarized. The top 50 clonotypes observed in the largest number of patients were selected for further analysis.
4.10. Statistical Methods
Statistical analyses in this study were primarily performed using R (version 4.3.1) and associated packages. A significance threshold of p < 0.05 was applied for all statistical tests, and multiple-comparison corrections were conducted using the false discovery rate (FDR) method.
Comparison of continuous variables: For comparisons of continuous variables between two groups—such as immune cell proportions, TCR clonality scores, and diversity scores—independent two-sample
t-tests were used when data met assumptions of normality and homogeneity of variances; otherwise, the Mann–Whitney U test (Wilcoxon rank-sum test) was applied. For comparisons among more than two groups, the Kruskal–Wallis test was employed. All box plots presented in the study (e.g.,
Figure 6A and
Figure S5) were compared using the Wilcoxon rank-sum test.
Survival analysis: In the validation cohort, survival curves were plotted using the Kaplan–Meier method, and differences between groups were assessed with the log-rank test. The Cox proportional hazards model was used to calculate hazard ratios (HRs) and their 95% confidence intervals to evaluate the independent predictive value of the prognostic risk score.
Correlation analysis: Associations between gene expression and immune cell infiltration levels were evaluated using Spearman’s rank correlation analysis.
Significance levels in figures are denoted as follows: * p < 0.05, ** p < 0.01, *** p < 0.001. NS indicates no statistical significance.
4.11. Prognostic Risk Model Construction
To validate the prognostic value of T-cell functional states in an independent cohort, we constructed a multi-gene risk-scoring model based on 11 T-cell function-related signature genes identified from single-cell analysis (NR4A1, GZMK, HSPA1A, OASL, CXCR4, CCL4, ENC1, DUSP2, IL7R, PRKCQ-AS1, MATK). Using the GEPIA2 online platform, the model was built separately for MSI-H and MSI-L patient subgroups within the TCGA colorectal cancer cohort. Patients were dichotomized into high-risk and low-risk groups based on the median risk score. Prognostic performance was evaluated using Kaplan–Meier survival analysis and Cox regression embedded in the platform. The model demonstrated significant survival discrimination in the MSI-H subgroup (Log-rank p = 0.016).