Peripheral Blood TCR Clonotype Diversity as a Biomarker for Colorectal Cancer

Zhu, Gaochen; Chen, Tao; Ma, Chen; Liu, Kai; Huang, Bihui; Yang, Guan

doi:10.3390/bioengineering12111215

Open AccessArticle

Peripheral Blood TCR Clonotype Diversity as a Biomarker for Colorectal Cancer

by

Gaochen Zhu

¹,

Tao Chen

²,

Chen Ma

³,

Kai Liu

¹

,

Bihui Huang

⁴

and

Guan Yang

^1,*

¹

Department of Infectious Diseases and Public Health, City University of Hong Kong, Kowloon, Hong Kong SAR 999077, China

²

Department of General Surgery, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200080, China

³

Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR 999077, China

⁴

Scientific Research Center, The Seventh Affiliated Hospital, Sun Yat-sen University, Shenzhen 518107, China

^*

Author to whom correspondence should be addressed.

Bioengineering 2025, 12(11), 1215; https://doi.org/10.3390/bioengineering12111215

Submission received: 15 October 2025 / Revised: 3 November 2025 / Accepted: 5 November 2025 / Published: 7 November 2025

(This article belongs to the Special Issue Computer Vision and Machine Learning in Medical Applications, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

There exists an urgent need to improve colorectal cancer (CRC) diagnosis due to limitations in current diagnostic approaches. Systematic characterization of the human T cell receptor (TCR) repertoire, coupled with advanced computational methods, provides a promising opportunity to develop more accurate and less invasive diagnostic strategies for this major malignancy. The main objective of this work is to establish a TCR repertoire-based diagnostic model for CRC using machine learning algorithms and to identify the most significant features contributing to accurate diagnosis. Through comprehensive comparative analysis of several machine learning algorithms, our results demonstrated that the Transformer model exhibited superior performance capabilities. The trained model achieved an area under the receiver operating characteristic curve (AUC) of 0.973 in predicting disease status in the internal test set. Furthermore, TCR repertoire analysis from the independent test set demonstrated robust predictions with an AUC of 0.814. Notably, we identified a panel of 50 TCR repertoire features that showed a diagnostic AUC of 0.869 using these 50 TCR CDR3 sequences. Together, this TCR repertoire-based disease model demonstrates significant potential for clinical applications in CRC diagnosis and treatment response monitoring. Furthermore, similar diagnostic models could be established for other immune-related diseases based on disease-specific TCR repertoire data.

Keywords:

TCR; machine learning; colorectal cancer; diagnosis

1. Introduction

Colorectal cancer (CRC) ranks among the top three most commonly diagnosed cancers globally and continues to be a significant contributor to cancer-related deaths [1]. Despite decades of cancer research, the fundamental mechanisms responsible for the initiation and progression of this ubiquitous disease have not been fully elucidated [2]. Early diagnosis of CRC is crucial for ensuring appropriate treatment, as the cancer can metastasize to other organs, particularly the liver and lungs, leading to secondary tumors and associated complications [3]. However, current diagnostic approaches such as colonoscopy, fecal occult blood test, and imaging technology have notable limitations, such as the risk of perforation, false-positive results, and missed non-bleeding lesions [4]. Therefore, there is an urgent need to explore and develop effective diagnostic approaches for CRC.

T lymphocytes play a crucial role in the immune response against tumors by recognizing tumor-associated antigens through T cell receptors (TCRs). TCR specificity and diversity are derived from the variable complementarity determining region 3 (CDR3) and the random rearrangement and mutations of Variable (V), Diversity (D), and Joining (J) regions. Numerous studies have highlighted the significance of TCR and CDR3 diversity in cancer diagnosis, therapy, and prognosis [5,6,7]. Thus, peripheral blood TCR repertoire profiling is emerging as a promising diagnostic tool for CRC, reflecting dynamic changes in the TCR repertoire that could serve as biomarkers for monitoring immunomodulatory processes.

Machine learning (ML), an advanced artificial intelligence technique for analyzing complex data, has proven successful in diagnosing and predicting various diseases, including cardiovascular diseases, cancer, and other immune-related disorders by analyzing the huge amount of microbiota data [8,9]. However, gut microbiota-based ML methods for cancer diagnosis have shown limitations, including insufficient biological evidence supporting microbiome-phenotype associations [10] and a high rate of misdiagnosis when applied to new populations [11]. In contrast, TCR-based ML methods can identify and predict tumor-associated antigens with high specificity, offering promising prospects for cancer diagnosis. In addition, this approach leverages the natural surveillance capabilities of the immune system against malignancies [12]. Therefore, investigating the diagnostic value of TCR repertoire-based ML in CRC represents a valuable research endeavor.

In recent years, several ML methods have been proposed to improve cancer prediction and diagnosis using TCR repertoire data. For instance, Ostmeyer et al. decomposed TCR sequences into 4-mer motifs and developed a logistic regression-based method for classifying tumor tissues from normal tissues [13]. Similarly, Beshnova et al. developed a convolutional neural network model called DeepCAT for cancer-associated TCR detection, which can further differentiate healthy individuals from cancer patients [14]. Despite these advancements, existing methods face challenges such as the need for large-scale labeled datasets, the complexity of deep learning models leading to interpretability issues, and potential overfitting due to high-dimensional TCR data [15,16]. Therefore, it is essential to develop alternative approaches that balance model complexity and interpretability while effectively utilizing TCR repertoire information for CRC diagnosis.

In this study, we propose a ML approach combined with feature selection techniques to construct a diagnostic model that addresses these challenges. Here, we construct a diagnosis model that utilizes peripheral blood TCR repertoire profiling to diagnose CRC and identify key biomarkers for CRC diagnosis. The findings are validated using publicly available TCR repertoire datasets, further strengthening the potential of this approach in cancer diagnostics. This approach could afford rapid and accurate classification from minimally invasive peripheral blood assays, providing biologically interpretable features and strong cross-cohort generalizability, thereby enabling safe, scalable deployment with reduced overfitting.

2. Materials and Methods

2.1. Data Collection

The TCR repertoire data were retrieved from the Sequence Read Archive database, available on the National Center for Biotechnology Information platform, and the Genome Sequence Archive in the National Genomics Data Center. A total of 220 CRC samples were collected from the following BioProjects: PRJNA754274 [17], which included 83 samples from 16 patients, with each patient having one or more samples sequenced; PRJCA009632 [18], 107 samples from 107 patients; and PRJNA1049886 [19], 30 samples from 30 patients. The 278 non-CRC samples were from the following BioProjects: PRJNA754274 [17], 20 samples from 20 healthy controls; PRJNA930724 [20], 22 samples from 22 healthy controls; PRJEB40492, 46 samples from 46 healthy controls; PRJEB50045 [21], 183 samples from 99 healthy individuals; and PRJNA821039 [22], 7 samples from 7 individuals. Apart from the four samples within BioProject PRJNA821039 that are from GATA2-deficient individuals, all remaining non-CRC samples are from healthy controls. All CRC patients enrolled in this study were recruited from China. The control group consisted of individuals from China and several European countries, including the United Kingdom, Germany, and Switzerland. A total of 347 subjects were recruited from multiple institutions, including the European Bioinformatics Institute, the Third Affiliated Hospital of Shandong First Medical University China, etc. The demographics of subjects recruited in this study are shown in Table S1. The discovery cohort included 123 CRC patients and 187 healthy individuals, which was split into training and internal test sets at a 70/30 ratio. The independent validation cohort, including 30 CRC patients and 7 non-CRC individuals, was used for external verification.

2.2. Bioinformatics and Statistical Analyses for the TCR Repertoire

The raw sequence data underwent quality filtering using Trimmomatic (V.39) to eliminate adaptor sequences and low-quality reads (quality score < 30). MiXCR (v.4.5.0) was employed to align the sequencing data to the reference sequences for the V, D, J, and Constant (C) gene segments of the TCR [23]. Following alignment, the data were assembled to determine the specific gene region sequences on the TCRβ, particularly the CDR3. Batch effect adjustment was performed using ComBat-seq, a tool for untransformed raw count data [24], to harmonize TRBV and TRBJ gene usage from different datasets generated by different labs. This methodology has been well applied in RNA-seq analyses, and has shown significant effects in reducing batch effects [25,26]. Additionally, a normalization strategy, where the clone counts from one cohort were used as a baseline, and the clone counts in other cohorts were scaled proportionally, was applied. This approach has been successfully utilized in previous TCR repertoire and single-cell RNA-seq studies to ensure comparability across samples [27,28]. Shannon diversity is computed as

H = - \sum_{i = 1}^{N} p_{i} \log_{2} p_{i},

(1)

where p_i is the frequency of sequence i in the repertoire and N is the total number of unique sequences.

Clonality = 1 - \frac{H}{\log_{2} (N)}

(2)

All statistical analyses were conducted using R software (v.4.3.1). All the comparisons between the healthy control (HC) and CRC groups were performed using the Mann–Whitney U test with false discovery rate (FDR) correction for multiple testing. Statistical significance was set at three levels: * p < 0.05, ** p < 0.01, and *** p < 0.001.

Data visualization was performed using R software. Additionally, GraphPad Prism software (v9.0.0) was used for graphical visualizations.

2.3. Supervised Machine Learning

The data processing and analysis workflow is presented in Figure 1. In the classification of CRC versus non-CRC, seven different supervised ML algorithms, which were Random Forest (RF), Logistic Regression (LR), K-Nearest Neighbors (KNN), Decision Tree (DT), Naïve Bayes, Gradient Boosting, and Adaptive Boosting (AdaBoost), and three different deep learning (DL) algorithms, which were multiple layer perceptron (MLP), convolutional neural network (CNN), and Transformer, were trained with the features of the TCR repertoire. CRC samples are labeled as positive and non-CRC samples are labeled as negative.

Among these algorithms, RF is a powerful ensemble learning method employed for classification tasks [29]. It builds multiple decision trees during training, with each tree constructed on a data sample extracted from a training set. The output of RF is the class selected from the most trees for classification tasks. LR is a statistical method used for binary classification, modeling the relationship between dependent and independent variables to predict the probability of occurrence within a binary outcome [30]. KNN is a non-parametric method used for classification and regression, identifying the majority class of k-nearest neighbors to the query point in the feature space [31]. DT is a powerful and widely used classification model in data mining and machine learning, which efficiently partitions data by recursively testing numeric features against threshold values [32]. Naïve Bayes classification is a probabilistic model based on Bayes’ theorem, assuming independence among predictors, which simplifies the computation of conditional probabilities [33]. Gradient Boosting is a machine learning technique that constructs an additive model in a forward stage-wise fashion. It iteratively trains a series of weak learners, typically decision trees, where each subsequent model focuses on fitting the residual errors of the preceding model [34]. AdaBoost is an iterative algorithm that combines multiple weak classifiers into a strong classifier by adjusting the weights of training samples in each training, focusing on samples misclassified in the previous round, thereby improving the classification performance of the model [35]. MLP is a feed-forward neural network with multiple fully connected layers and nonlinear activations, capable of modeling complex, nonlinear relationships in high-dimensional data [36]. CNN is a specialized neural network architecture that employs convolutional layers to automatically extract hierarchical features from sequential or structured data, followed by pooling into fully connected layers for classification [37]. Transformer is an attention-based model that leverages self-attention mechanisms to process sequential data in parallel, excelling in capturing long-range dependencies without reliance on recurrence or convolution [38,39]. The Transformer algorithm’s robustness against overfitting in high-dimensional spaces, its ability to model positional encodings for sequence data, and its capacity to provide interpretable attention weights make it particularly suitable for TCR repertoire analysis.

We analyzed the CDR3 amino acid sequences from each sample, using their normalized frequencies as quantitative features. All unique CDR3 sequences across samples formed the feature set, with the top 1000 most frequent clones selected globally and further reduced to 500 via ANOVA F-value feature selection. Each sample was transformed into a feature vector where each element corresponds to the fraction of a specific CDR3 sequence. Missing values (i.e., sequences not present in a sample) were set to zero. The final input was a feature matrix X (n × m), where n is the number of samples, and m is the number of selected unique CDR3 sequences identified across all samples. The label y was assigned as ‘CRC’ for colorectal cancer samples and ‘HC’ for healthy controls.

Categorical labels were encoded into binary values (1 for ‘CRC’ and 0 for ‘HC’) using scikit-learn’s LabelEncoder. The discovery cohort was then split into training and internal test sets using a stratified 70/30 split to maintain the proportion of classes in both sets which is essential to ensure balanced representation of both classes in the training and testing datasets.

We employed stratified k-fold cross-validation with five folds (n_splits = 5), ensuring that each fold had a similar class distribution. For each algorithm, hyperparameter tuning was performed to optimize model performance. We defined parameter grids and utilized scikit-learn’s GridSearchCV for exhaustive search over specified parameter values. The models and their associated hyperparameters are as follows:

Random Forest: Number of estimators (n_estimators), maximum depth (max_depth), maximum features (max_features), and class weights (class_weight).

Logistic Regression: Regularization strength (C), penalty (penalty), and solver (solver).

K-Nearest Neighbors: Number of neighbors (n_neighbors) and weighting function (weights).

Naïve Bayes: Since the Gaussian Naïve Bayes classifier does not have hyperparameters that require tuning, it was used with default settings.

Decision Tree: Maximum depth (max_depth), minimum samples required to split (min_samples_split), and class weights (class_weight).

Gradient Boosting: Number of estimators (n_estimators), learning rate (learning_rate), and maximum depth (max_depth).

AdaBoost: Number of estimators (n_estimators) and learning rate (learning_rate).

Multilayer Perceptron: Hidden layer sizes (hidden_sizes), dropout rate (dropout), learning rate (learning_rate), number of epochs (epochs), and batch size (batch_size).

Convolutional Neural Network: Number of convolutional layers (conv_layers), filters per layer (filters), kernel size (kernel_size), pooling type (pooling), dropout rate (dropout), learning rate (learning_rate), number of epochs (epochs), and batch size (batch_size).

Transformer: Embedding dimension (d_model), number of attention heads (nhead), number of encoder layers (num_layers), dropout rate (dropout), learning rate (learning_rate), number of epochs (epochs), and batch size (batch_size).

More detailed model parameters were depicted in Table S2. Models were trained on the training folds and validated on the validation fold within the cross-validation framework. The hyperparameters yielding the highest mean AUC score during cross-validation were selected for each model. Accuracy, area under the receiver operating characteristic curve (AUC), recall, and F1-score were employed to measure the model performance. These are well-known evaluation metrics and have been used in previous bioinformatics studies [40,41].

After hyperparameter tuning and model training, we evaluated the models on the internal test set. We computed the mean and standard deviation of accuracy and AUC across the five folds to assess the stability and generalization capability of each model. The best-performing model from the cross-validation (based on AUC) was evaluated on the held-out test set to assess its predictive performance on unseen data. For each model, we generated the receiver operating characteristic (ROC) curves by plotting the true positive rate (sensitivity) against the false positive rate (1—specificity) at various threshold settings. Finally, we employed the optimal model to verify the AUC on the independent external validation cohort.

The analysis was conducted using Python (v.3.6.13) with the following libraries: pandas (v.1.1.5), NumPy (v.1.19.2), scikit-learn (v.0.24.2), and Matplotlib (v.3.3.4). Additionally, torch (v.2.5.1) was used for implementing deep learning models.

2.4. Feature Importance Measures

To assess the impact of each feature on the classifier’s predictive performance, we employed permutation importance within a cross-validation framework. This model-agnostic method measures the decrease in model performance when a single feature’s values are randomly shuffled, providing an unbiased estimation of feature importance suitable for high-dimensional datasets.

We used a five-fold stratified cross-validation to ensure that each fold had an approximately equal proportion of CRC and healthy samples, which is crucial to prevent class imbalance from affecting the model evaluation and the permutation importance estimates. For each feature in the validation set, we performed random shuffling of its values to disrupt any relationship with the target variable. The decrease in model performance was computed by comparing the model’s performance on the original validation data versus the permuted data. The importance score for each feature was calculated as the mean decrease in performance over 5 shuffling iterations (n_repeats = 5), balancing reliability and computational cost. The importance scores from each fold were accumulated and averaged across all folds to obtain a stable estimate of each feature’s importance.

3. Results

3.1. Comparison of TCR Repertoires Between CRC Patients and Healthy Controls

We reanalyzed the TCRβ repertoires in the peripheral blood of 187 healthy individuals and 123 CRC patients in the training and test sets [17,20]. Each sample exhibited distinct features in TCR diversity. Healthy individuals exhibited significantly higher number of unique TCR clones compared to CRC patients (Figure 2A). TCR clones with a frequency above 0.5% of total reads in a sample were defined as high-expansion clone (HEC) [18]. The comparison of HEC numbers between the CRC and HC groups showed a significantly higher HEC ratio in the CRC group (Figure 2B). Additionally, healthy donors had relatively high TCR diversity and low TCR clonality (Figure 2C,D). The length distributions of TCRβ CDR3 in healthy individuals and CRC patients are different (Figure 2E). CDR3 amino acid lengths of shorter than or equal to 14 were more frequent in CRC patients than in healthy individuals while CDR3 amino acid lengths of longer than 14 were more frequent in healthy individuals. Subsequently, we eliminated the sequences that appeared in the healthy group and ranked the remaining CRC-specific CDR3 sequences according to the occurrence. The top 30 of CRC-specific CDR3 sequences are shown in the histogram (Figure 2F). Furthermore, significantly higher usages of TRBV1, TRBV4-3, TRBV5-2, TRBV5-3, TRBV5-7, TRBV6-1, TRBV6-8, TRBV6-9, TRBV7-1, TRBV7-3, TRBV7-5, TRBV12-1, TRBV12-2, TRBV21-1, and TRBV22-1 were observed in healthy donors compared to CRC patients among the functional human Vβ genes. Conversely, significantly higher usages of TRBV6-7, TRBV16, TRBV23-1, and TRBV30 were observed in CRC patients compared to healthy donors (Figure 2G, FDR-corrected Mann–Whitney U test, * p < 0.05, ** p < 0.01, *** p < 0.001). However, among the functional Jβ genes, no significant differences were observed between CRC patients and healthy individuals. (Figure 2H, FDR- corrected Mann–Whitney U test). These results demonstrate the consistency of the TCR repertoire across different CRC datasets and further substantiate the potential of the TCR repertoire as a biomarker for diagnosing CRC.

3.2. Performance of TCR Repertoire-Based ML Models for CRC Diagnosis

Given that TCR diversity is primarily determined by CDR3, our study focused on the usage of CDR3 sequences. We constructed CRC diagnostic models using CDR3 amino acid sequences and their corresponding fractions as the input features. To evaluate the effectiveness of various ML and DL algorithms in diagnosing CRC based on TCR repertoire data, we assessed the AUC of ten different models on the internal test set: RF, LR, KNN, DT, Naive Bayes, Gradient Boosting, AdaBoost, MLP, CNN, and Transformer. For each algorithm, we searched across the parameter space specified in Table S2 to identify the optimal hyperparameter configuration. We then trained the model using the configuration that achieved the best performance under five-fold cross-validation and evaluated it on the test set. As illustrated in Figure 3A, the ROC curves obtained from the internal test set demonstrate that the Transformer model achieved the highest overall performance, with an AUC value of 0.973.

We further validated the accuracy, recall, and F1-score of these models on the internal test set (Figure 3B). The results indicate that the Transformer model maintained superior performance, achieving an accuracy of 0.91, recall of 0.82, and F1-score of 0.89. In contrast to another artificial intelligence-assisted technique for identifying CRC through blood tests [42], which consists of a binary SVM classifier at the first level and a one-class SVM classifier at the second level (achieving 60% sensitivity and 79% specificity), and the use of Red Cell Distribution Width [43], which demonstrated a sensitivity of 84% and a specificity of 88% for right-sided colon cancer, our model demonstrates superior performance.

In summary, these results demonstrate that TCR repertoire profiling, when analyzed through ML models, can serve as a highly accurate and robust diagnostic marker for CRC, presenting significant potential for clinical applications in early detection.

3.3. Identification of Key TCR Repertoire Biomarkers

In our analysis, to identify significant TCR repertoire biomarkers for CRC diagnosis, we employed a ML model based on Transformer enhanced by permutation importance to quantify the impact of individual TCR features on disease classification. Figure 3C and Table S3 showed the top 50 TCR features ranked by their permutation importance scores. Among these, CASTSGSDTQYF (TRBV9/TRBJ2-3) showed the highest importance score, indicating its critical role in differentiating CRC from healthy controls. Additionally, these top 50 features yielded an AUC of 0.869 and an accuracy of 0.8273 for the diagnosis of CRC (Figure 3D).

We retrained the model using the top 30, top 40, and top 50 features and subsequently plotted the ROC curves on the internal test set (Figure S1A). We also compared the AUC on the external test set across these three models (Figure S1B). We found that the model trained with the top 50 features achieved the highest AUC on the internal and external test set, recording values of 0.869 and 0.752, respectively. This result underscores the importance of the 50 features identified for the model.

3.4. Validation of TCR Repertoire-Based Models for CRC Diagnosis in Independent Cohorts

To externally validate the diagnostic value and mitigate the risk of over-optimistic reporting of diagnostic accuracy, we assessed the Transformer model using all features and Top 50 features in the independent test sets. The Transformer model using all features showed an AUC of 0.814 (Figure 4A) and an accuracy of 0.8108, while the 50-marker model reached an AUC of 0.752 (Figure 4B) and an accuracy of 0.7568.

To further investigate the patterns of these key TCR features across individual samples, we employed an FDR-corrected Mann–Whitney U test to compare the CDR3-derived TRBV-TRBJ combination usage, which was normalized by Combat-seq. In total, 37 of the top 50 TRBV-TRBJ combinations showed a significant difference between healthy individuals and CRC patients (Figure 4C), with FDR-corrected p < 0.05, indicating their potential association with CRC status. This analysis not only underscores the utility of TCR repertoire analysis for identifying potential diagnostic markers but also suggests that certain combinations of TCR gene segments may be intimately involved in the immune response to CRC.

These findings provide a valuable foundation for further investigations into the TCR repertoire in CRC and may contribute to the development of novel diagnostic and therapeutic strategies targeting specific TCR interactions.

4. Discussion

4.1. Peripheral Blood TCR Repertoire Profiling Enables Accurate, Non-Invasive CRC Diagnosis

In this study, we demonstrate that peripheral blood TCR repertoire profiling, integrated with machine learning, offers a highly accurate and non-invasive approach for CRC diagnosis. Our data suggest alterations in TCR diversity and clonality in the peripheral blood of CRC patients compared with healthy individuals, highlighting the potential of TCR signatures as biomarkers. Notably, we successfully identified a panel of 50 key TCR biomarkers that differentiate CRC patients from healthy individuals, with CASTSGSDTQYF (TRBV9/TRBJ2-3) emerging as the most discriminative feature based on permutation importance. Indeed, increasing evidence supports the use of deep sequencing-based TCR repertoires as biomarkers for immune response in cancer patients [44]. Emerging techniques utilizing TCR gene sequencing offer novel approaches to evaluate lymphocytic infiltration, providing valuable insights into both clonality and abundance [45]. High clonality suggests the presence of a few dominant clones, while low clonality indicates the existence of multiple clones with similar abundance. Our observation of elevated clonality in CRC patients (Figure 2D) suggests oligoclonal expansions potentially elicited by tumor neoantigens, a phenomenon observed across various malignancies [46,47]. The specific TRBV-TRBJ patterns we identified, such as TRBV18/TRBJ2-7, thus represent a discriminatory immune signature for CRC that warrants further mechanistic exploration.

4.2. Key TCR Repertoire Features Underpin Robust Performance

We found that CRC patients exhibit significant differences in the usage of specific Vβ gene segments in the TCRs compared to healthy individuals. Certain Vβ genes, such as TRBV6-7, TRBV16, TRBV23-1, and TRBV30 were overrepresented in CRC patients (Figure 2G). These findings suggest that certain TCR gene segment usages are associated with CRC, potentially reflecting an adaptive immune response to tumor-associated antigens. Similar patterns have been observed in other cancers, where particular TCR clonotypes are enriched in affected patients [48]. These data underscore the utility of TCR repertoire analysis in identifying potential diagnostic markers and providing insights into the immune mechanisms involved in CRC.

Our data further demonstrate the feasibility of utilizing the TCR repertoire-based model for CRC diagnosis. Notably, our innovative focus on CDR3 sequences features, rather than solely on TRBV-TRBJ gene combination, has led to enhanced model performance compared to previous methods. By leveraging these specific TCR clones, we have improved the predictive accuracy for CRC diagnosis, highlighting the potential of these features as robust disease biomarkers. Compared to established CRC diagnostics like colonoscopy (sensitivity ~95% for advanced lesions but <70% for early polyps [4]), our TCR-based model offers a non-invasive alternative with potential for early-stage detection via immune profiling, as evidenced by CDR3 sequences and differential TRBV-TRBJ usages in peripheral blood from datasets including pre-treatment CRC samples [17]. Future prospective studies in early-stage cohorts will further validate its clinical merit. Incorporating the top 50 TCR features also yielded high diagnostic accuracy, suggesting that a refined biomarker panel could streamline clinical implementation. Importantly, several of these features show overlap with signatures in other cancers. For instance, the TRBV18/TRBJ2-7 combination corresponding to CASSPNNYEQYF in our panel has also been reported as a top contributor to CRC prediction models in lymph node metastasis studies [19]. Similarly, TRBV19/TRBJ1-2 combinations, which correspond to highly expanded clones such as CASKGVSNYGYTF and CASSASGTAYGYTF in our dataset, were among the most frequent in lymph nodes and peripheral blood mononuclear cells of papillary thyroid carcinoma patients [49]. Additionally, the TRBV7-2/TRBJ2-1 combination, which underlies CASSFAGTSGMNEQFF in our results, represents one of the most abundant segments identified in clear cell renal cell carcinoma [50]. These cross-cancer associations imply shared immunogenic motifs or convergent evolution in anti-tumor immunity, with implications for pan-cancer biomarkers and personalized therapies targeting conserved TCR-antigen interactions.

4.3. Clinical Implications, Limitations, and Future Directions

Although our findings advance TCR-based diagnostics, several limitations should be acknowledged. Firstly, the sample population primarily consisted of individuals from Asia (China) and Europe (the United Kingdom, Germany, and Switzerland). This geographic limitation may affect the generalizability of our findings to other populations with different genetic backgrounds and environmental exposures. Future studies should include more diverse populations to validate and extend our findings. Additionally, while next-generation sequencing costs have decreased to approximately $50 per sample, enabling scalability via cloud-based machine learning pipelines, integration into routine clinical workflows requires standardized protocols and automated bioinformatic tools to enhance efficiency and accessibility. Finally, regulatory approval for TCR profiling as a diagnostic assay remains a key challenge, necessitating prospective clinical trials to establish sensitivity, specificity, and cost-effectiveness in real-world settings. Future research should focus on multi-center collaborations spanning underrepresented regions, the development of universally accepted workflow standards, and the design of prospective validation studies in diverse healthcare contexts. Addressing these limitations through future research will be critical to translating this approach into clinical practice.

In conclusion, our data indicated that TCR diversity and clonality in peripheral blood could be used for CRC diagnosis. Our model can also be extended for the prediction and diagnosis of other immune-mediated diseases by analyzing disease-specific TCR repertoire and implementing our ML approach based on CDR3 sequences.

5. Patents

A US provisional patent (No. 63/625,326, Title: TCR repertoire-based machine learning for colorectal cancer diagnosis) has been filed based on this work. A China invention patent has also been filed based on this work (No. 202510107763.4, Title: TCR repertoire-based machine learning for colorectal cancer diagnosis).

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bioengineering12111215/s1, Figure S1: Performance of Transformer models employing different numbers of features in the test cohort; Table S1: Demographics of subjects recruited in this study; Table S2: Parameters employed to construct different ML models; Table S3: Top 50 TCR features ranked by permutation importance.

Author Contributions

Conceptualization, G.Z. and G.Y.; methodology, G.Z. and G.Y.; software, G.Z. and G.Y.; validation, G.Z. and G.Y.; formal analysis, G.Z.; investigation, G.Z. and G.Y.; resources, G.Z. and G.Y.; data curation, G.Z. and G.Y.; writing—original draft preparation, G.Z. and G.Y.; writing—review and editing, G.Z., T.C., C.M., K.L., B.H. and G.Y.; visualization, G.Z. and G.Y.; supervision, G.Y.; project administration, G.Z. and G.Y.; funding acquisition, G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Institute Digital Medicine grant at the City University of Hong Kong, grant number 9229501-13-YG. Open Access made possible with partial support from the Open Access Publishing Fund of the City University of Hong Kong. The funding source has no role in the writing of the manuscript or the decision to submit it for publication.

Institutional Review Board Statement

The study was approved by the Human Subject Ethics Sub-Committee of the Jockey Club College of Veterinary Medicine and Life Sciences (jcc2324ay004), City University of Hong Kong.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in the National Center for Biotechnology Information platform [PRJNA754274, PRJCA009632, PRJEB40492, PRJEB50045, PRJNA821039].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rawla, P.; Sunkara, T.; Barsouk, A. Epidemiology of colorectal cancer: Incidence, mortality, survival, and risk factors. Gastroenterol. Rev. Przegląd Gastroenterol. 2019, 14, 89–103. [Google Scholar] [CrossRef]
Bell, C.C.; Gilan, O. Principles and mechanisms of non-genetic resistance in cancer. Br. J. Cancer 2020, 122, 465–472. [Google Scholar] [CrossRef] [PubMed]
Mármol, I.; Sánchez-de-Diego, C.; Pradilla Dieste, A.; Cerrada, E.; Rodriguez Yoldi, M.J. Colorectal carcinoma: A general overview and future perspectives in colorectal cancer. Int. J. Mol. Sci. 2017, 18, 197. [Google Scholar] [CrossRef] [PubMed]
Lin, J.S.; Piper, M.A.; Perdue, L.A.; Rutter, C.M.; Webber, E.M.; O’Connor, E.; Smith, N.; Whitlock, E.P. Screening for colorectal cancer: Updated evidence report and systematic review for the US Preventive Services Task Force. JAMA 2016, 315, 2576–2594, Erratum in JAMA 2021, 325, 1978–1998. [Google Scholar] [CrossRef]
Jia, Q.; Zhou, J.; Chen, G.; Shi, Y.; Yu, H.; Guan, P.; Lin, R.; Jiang, N.; Yu, P.; Li, Q.-J. Diversity index of mucosal resident T lymphocyte repertoire predicts clinical prognosis in gastric cancer. Oncoimmunology 2015, 4, e1001230. [Google Scholar] [CrossRef]
Han, Y.; Liu, X.; Wang, Y.; Wu, X.; Guan, Y.; Li, H.; Chen, X.; Zhou, B.; Yuan, Q.; Ou, Y. Identification of characteristic TRB V usage in HBV-associated HCC by using differential expression profiling analysis. Oncoimmunology 2015, 4, e1021537. [Google Scholar] [CrossRef]
Postow, M.A.; Manuel, M.; Wong, P.; Yuan, J.; Dong, Z.; Liu, C.; Perez, S.; Tanneau, I.; Noel, M.; Courtier, A. Peripheral T cell receptor diversity is associated with clinical outcomes following ipilimumab treatment in metastatic melanoma. J. Immunother. Cancer 2015, 3, 23. [Google Scholar] [CrossRef]
Rahman, S.F.; Olm, M.R.; Morowitz, M.J.; Banfield, J.F. Machine learning leveraging genomes from metagenomes identifies influential antibiotic resistance genes in the infant gut microbiome. MSystems 2018, 3, e00123-17. [Google Scholar] [CrossRef]
Cammarota, G.; Ianiro, G.; Ahern, A.; Carbone, C.; Temko, A.; Claesson, M.J.; Gasbarrini, A.; Tortora, G. Gut microbiome, big data and machine learning to promote precision medicine for cancer. Nat. Rev. Gastroenterol. Hepatol. 2020, 17, 635–648. [Google Scholar] [CrossRef] [PubMed]
Su, Q.; Liu, Q.; Lau, R.I.; Zhang, J.; Xu, Z.; Yeoh, Y.K.; Leung, T.W.; Tang, W.; Zhang, L.; Liang, J.Q. Faecal microbiome-based machine learning for multi-class disease diagnosis. Nat. Commun. 2022, 13, 6818. [Google Scholar] [CrossRef]
Manandhar, I.; Alimadadi, A.; Aryal, S.; Munroe, P.B.; Joe, B.; Cheng, X. Gut microbiome-based supervised machine learning for clinical diagnosis of inflammatory bowel diseases. Am. J. Physiol. Gastrointest. Liver Physiol. 2021, 320, G328–G337. [Google Scholar] [CrossRef]
Katayama, Y.; Yokota, R.; Akiyama, T.; Kobayashi, T.J. Machine learning approaches to TCR repertoire analysis. Front. Immunol. 2022, 13, 858057. [Google Scholar] [CrossRef]
Ostmeyer, J.; Christley, S.; Rounds, W.H.; Toby, I.; Greenberg, B.M.; Monson, N.L.; Cowell, L.G. Statistical classifiers for diagnosing disease from immune repertoires: A case study using multiple sclerosis. BMC Bioinform. 2017, 18, 401. [Google Scholar] [CrossRef]
Beshnova, D.; Ye, J.; Onabolu, O.; Moon, B.; Zheng, W.; Fu, Y.-X.; Brugarolas, J.; Lea, J.; Li, B. De novo prediction of cancer-associated T cell receptors for noninvasive cancer detection. Sci. Transl. Med. 2020, 12, eaaz3738. [Google Scholar] [CrossRef]
Xu, Y.; Qian, X.; Zhang, X.; Lai, X.; Liu, Y.; Wang, J. DeepLION: Deep multi-instance learning improves the prediction of cancer-associated T cell receptors for accurate cancer detection. Front. Genet. 2022, 13, 860510. [Google Scholar] [CrossRef]
Sidhom, J.-W.; Oliveira, G.; Ross-MacDonald, P.; Wind-Rotolo, M.; Wu, C.J.; Pardoll, D.M.; Baras, A.S. Deep learning reveals predictive sequence concepts within immune repertoires to immunotherapy. Sci. Adv. 2022, 8, eabq5089. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.-T.; Hsu, H.-C.; Lee, Y.-S.; Liu, H.; Tan, B.C.-M.; Chin, C.-Y.; Chang, I.Y.-F.; Yang, C.-Y. Longitudinal high-throughput sequencing of the T-cell receptor repertoire reveals dynamic change and prognostic significance of peripheral blood TCR diversity in metastatic colorectal cancer during chemotherapy. Front. Immunol. 2022, 12, 743448. [Google Scholar] [CrossRef]
Cao, Y.; Wang, J.; Hou, W.; Ding, Y.; Zhu, Y.; Zheng, J.; Huang, Q.; Cao, Z.; Xie, R.; Wei, Q. Colorectal cancer–associated T cell receptor repertoire abnormalities are linked to gut microbiome shifts and somatic cell mutations. Gut Microbes 2023, 15, 2263934. [Google Scholar] [CrossRef] [PubMed]
Zhen, Y.N.; Wang, H.; Jiang, R.; Wang, F.; Chen, C.; Xu, Z.; Xiao, R. Characterization of the T-cell receptor repertoire associated with lymph node metastasis in colorectal cancer. Front. Oncol. 2024, 14, 1354533. [Google Scholar] [CrossRef]
Malik, A.; Sayed, A.A.; Han, P.; Tan, M.M.; Watt, E.; Constantinescu-Bercu, A.; Cocker, A.T.; Khoder, A.; Saputil, R.C.; Thorley, E. The role of CD8+ T-cell clones in immune thrombocytopenia. Blood J. Am. Soc. Hematol. 2023, 141, 2417–2429. [Google Scholar] [CrossRef] [PubMed]
Rosati, E.; Martini, G.R.; Pogorelyy, M.V.; Minervina, A.A.; Degenhardt, F.; Wendorff, M.; Sari, S.; Mayr, G.; Fazio, A.; Dowds, C.M. A novel unconventional T cell population enriched in Crohn’s disease. Gut 2022, 71, 2194–2204. [Google Scholar] [CrossRef]
Von Niederhäusern, V.; Ghraichy, M.; Trück, J. Applicability of T cell receptor repertoire sequencing analysis to unbalanced clinical samples–comparing the T cell receptor repertoire of GATA2 deficient patients and healthy controls. Swiss Med. Wkly. 2023, 153, 40046. [Google Scholar] [CrossRef]
Bolotin, D.A.; Poslavsky, S.; Mitrophanov, I.; Shugay, M.; Mamedov, I.Z.; Putintseva, E.V.; Chudakov, D.M. MiXCR: Software for comprehensive adaptive immunity profiling. Nat. Methods 2015, 12, 380–381. [Google Scholar] [CrossRef]
Zhang, Y.; Parmigiani, G.; Johnson, W.E. ComBat-seq: Batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2020, 2, lqaa078. [Google Scholar] [CrossRef] [PubMed]
Zhang, X. Highly effective batch effect correction method for RNA-seq count data. Comput. Struct. Biotechnol. J. 2025, 27, 58–64. [Google Scholar] [CrossRef]
Zhang, K.; Erkan, E.P.; Jamalzadeh, S.; Dai, J.; Andersson, N.; Kaipio, K.; Lamminen, T.; Mansuri, N.; Huhtinen, K.; Carpén, O. Longitudinal single-cell RNA-seq analysis reveals stress-promoted chemoresistance in metastatic ovarian cancer. Sci. Adv. 2022, 8, eabm1831. [Google Scholar] [CrossRef] [PubMed]
Gurun, B.; Horton, W.; Murugan, D.; Zhu, B.; Leyshock, P.; Kumar, S.; Byrne, K.T.; Vonderheide, R.H.; Margolin, A.A.; Mori, M. An open protocol for modeling T Cell Clonotype repertoires using TCRβ CDR3 sequences. BMC Genom. 2023, 24, 349. [Google Scholar] [CrossRef] [PubMed]
Lun, A.T.; Calero-Nieto, F.J.; Haim-Vilmovsky, L.; Göttgens, B.; Marioni, J.C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 2017, 27, 1795–1806. [Google Scholar] [CrossRef]
Qi, Y. Random forest for bioinformatics. In Ensemble Machine Learning: Methods and Applications; Springer: Berlin/Heidelberg, Germany, 2012; pp. 307–323. [Google Scholar]
Uddin, S.; Khan, A.; Hossain, M.E.; Moni, M.A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 2019, 19, 281. [Google Scholar] [CrossRef]
Bansal, M.; Goyal, A.; Choudhary, A. A comparative analysis of K-nearest neighbor, genetic, support vector machine, decision tree, and long short term memory algorithms in machine learning. Decis. Anal. J. 2022, 3, 100071. [Google Scholar] [CrossRef]
Charbuty, B.; Abdulazeez, A. Classification based on decision tree algorithm for machine learning. J. Appl. Sci. Technol. Trends 2021, 2, 20–28. [Google Scholar] [CrossRef]
Boyko, N.; Boksho, K. Application of the naive bayesian classifier in work on sentimental analysis of medical data. In Proceedings of the IDDM’2020: 3rd International Conference on Informatics & Data-Driven Medicine, Växjö, Sweden, 19–21 November 2020. [Google Scholar]
Ning, Y.; Zhang, S.; Nie, X.; Li, G.; Zhao, G. Fall detection algorithm based on gradient boosting decision tree. In Proceedings of the 2019 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Dalian, China, 20–22 September 2019. [Google Scholar]
Ding, Y.; Zhu, H.; Chen, R.; Li, R. An efficient AdaBoost algorithm with the multiple thresholds classification. Appl. Sci. 2022, 12, 5872. [Google Scholar] [CrossRef]
Wang, Z.; Wang, Y.; Xuan, J.; Dong, Y.; Bakay, M.; Feng, Y.; Clarke, R.; Hoffman, E.P. Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data. Bioinformatics 2006, 22, 755–761. [Google Scholar] [CrossRef] [PubMed]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Ghosh, N.; Santoni, D.; Saha, I.; Felici, G. A review on the applications of Transformer-based language models for nucleotide sequence analysis. Comput. Struct. Biotechnol. J. 2025, 27, 1244–1254. [Google Scholar] [CrossRef]
Choi, S.R.; Lee, M. Transformer architecture and attention mechanisms in genome data analysis: A comprehensive review. Biology 2023, 12, 1033. [Google Scholar] [CrossRef]
Kha, Q.-H.; Tran, T.-O.; Nguyen, V.-N.; Than, K.; Le, N.Q.K. An interpretable deep learning model for classifying adaptor protein complexes from sequence information. Methods 2022, 207, 90–96. [Google Scholar] [CrossRef]
Kha, Q.-H.; Le, V.-H.; Hung, T.N.K.; Nguyen, N.T.K.; Le, N.Q.K. Development and validation of an explainable machine learning-based prediction model for drug–food interactions from chemical structures. Sensors 2023, 23, 3962. [Google Scholar] [CrossRef]
Soares, F.; Becker, K.; Anzanello, M.J. A hierarchical classifier based on human blood plasma fluorescence for non-invasive colorectal cancer screening. Artif. Intell. Med. 2017, 82, 1–10. [Google Scholar] [CrossRef]
Nogueira-Rodríguez, A.; Domínguez-Carbajales, R.; López-Fernández, H.; Iglesias, Á.; Cubiella, J.; Fdez-Riverola, F.; Reboiro-Jato, M.; Glez-Pena, D. Deep neural networks approaches for detecting and classifying colorectal polyps. Neurocomputing 2021, 423, 721–734. [Google Scholar] [CrossRef]
Cui, J.-H.; Lin, K.-R.; Yuan, S.-H.; Jin, Y.-B.; Chen, X.-P.; Su, X.-K.; Jiang, J.; Pan, Y.-M.; Mao, S.-L.; Mao, X.-F. TCR repertoire as a novel indicator for immune monitoring and prognosis assessment of patients with cervical cancer. Front. Immunol. 2018, 9, 2729. [Google Scholar] [CrossRef]
Sanz-Pamplona, R.; Melas, M.; Maoz, A.; Schmit, S.L.; Rennert, H.; Lejbkowicz, F.; Greenson, J.K.; Sanjuan, X.; Lopez-Zambrano, M.; Alonso, M.H. Lymphocytic infiltration in stage II microsatellite stable colorectal tumors: A retrospective prognosis biomarker analysis. PLoS Med. 2020, 17, e1003292. [Google Scholar] [CrossRef] [PubMed]
Simnica, D.; Akyüz, N.; Schliffke, S.; Mohme, M.; Wenserski, L.V.; Mährle, T.; Fanchi, L.F.; Lamszus, K.; Binder, M. T cell receptor next-generation sequencing reveals cancer-associated repertoire metrics and reconstitution after chemotherapy in patients with hematological and solid tumors. Oncoimmunology 2019, 8, e1644110. [Google Scholar] [CrossRef]
Borràs, D.M.; Verbandt, S.; Ausserhofer, M.; Sturm, G.; Lim, J.; Verge, G.A.; Vanmeerbeek, I.; Laureano, R.S.; Govaerts, J.; Sprooten, J. Single cell dynamics of tumor specificity vs bystander activity in CD8+ T cells define the diverse immune landscapes in colorectal cancer. Cell Discov. 2023, 9, 114. [Google Scholar] [CrossRef]
Valpione, S.; Mundra, P.A.; Galvani, E.; Campana, L.G.; Lorigan, P.; De Rosa, F.; Gupta, A.; Weightman, J.; Mills, S.; Dhomen, N. The T cell receptor repertoire of tumor infiltrating T cells is predictive and prognostic for cancer survival. Nat. Commun. 2021, 12, 4098. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Liu, Y.; Chen, L.; Chen, Z.; Wang, X.; Jiang, R.; Zhao, K.; He, X. T cell receptor Beta-chain profiling of tumor tissue, peripheral blood and regional lymph nodes from patients with papillary thyroid carcinoma. Front. Immunol. 2021, 12, 595355. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Zhang, Q.; Zhu, C.; Shi, Z.; Shao, C.; Chen, Y.; Wang, N.; Jiang, Y.; Liang, Q.; Wang, K. The intrarenal landscape of T cell receptor repertoire in clear cell renal cell cancer. J. Transl. Med. 2022, 20, 558. [Google Scholar] [CrossRef]

Figure 1. Schematic illustration of the pipeline in TCR repertoire analysis. Framework for sample collection, library establishment, TCR repertoire construction, CRC diagnostic model, and biological interpretation. All comparisons were performed using FDR-corrected Mann–Whitney U test, *** p < 0.001.

Figure 2. Comprehensive analysis of TCR repertoire characteristics and differential Vβ and Jβ gene usage in CRC patients and healthy controls (HC). (A) Comparison of unique clone number between HC and CRC patients. Each data point represents an individual sample. (B) Comparison of HEC number distribution between HC and CRC patients. (C) TCR repertoire Shannon Diversity comparison. (D) TCR repertoire Clonality comparison. (E) TCRβ amino acid sequence length distribution comparison. (F) CRC-specific sequence diagram. The top 30 CRC-specific CDR3 sequences were shown in the histogram. (G) Vβ gene usage comparison. (H) Jβ gene usage comparison. All comparisons were performed using FDR-corrected Mann–Whitney U test, * p < 0.05, ** p < 0.01, *** p < 0.001.

Figure 3. Machine learning models for diagnosing CRC. (A) The ROC curve for the internal test set across 7 machine learning (ML) and 3 deep learning (DL) methods. (B) Comparative analysis of the internal test set accuracy, AUC, recall, and F1-score across 7 ML and 3 DL methods. (C) The top 50 feature importances determined by permutation importance for the Transformer model. (D) The ROC curve of the Transformer model using top 50 features for the internal test set.

Figure 4. Validation of Transformer models on external validation datasets. ROC curves of CRC diagnostic model using all CDR3 features (A) and top 50 features (B) on external validation datasets. (C) Top 50 TRBV-TRBJ gene combination usage comparison. FDR-corrected Mann–Whitney U test, * p < 0.05, ** p < 0.01, *** p < 0.001.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, G.; Chen, T.; Ma, C.; Liu, K.; Huang, B.; Yang, G. Peripheral Blood TCR Clonotype Diversity as a Biomarker for Colorectal Cancer. Bioengineering 2025, 12, 1215. https://doi.org/10.3390/bioengineering12111215

AMA Style

Zhu G, Chen T, Ma C, Liu K, Huang B, Yang G. Peripheral Blood TCR Clonotype Diversity as a Biomarker for Colorectal Cancer. Bioengineering. 2025; 12(11):1215. https://doi.org/10.3390/bioengineering12111215

Chicago/Turabian Style

Zhu, Gaochen, Tao Chen, Chen Ma, Kai Liu, Bihui Huang, and Guan Yang. 2025. "Peripheral Blood TCR Clonotype Diversity as a Biomarker for Colorectal Cancer" Bioengineering 12, no. 11: 1215. https://doi.org/10.3390/bioengineering12111215

APA Style

Zhu, G., Chen, T., Ma, C., Liu, K., Huang, B., & Yang, G. (2025). Peripheral Blood TCR Clonotype Diversity as a Biomarker for Colorectal Cancer. Bioengineering, 12(11), 1215. https://doi.org/10.3390/bioengineering12111215

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Peripheral Blood TCR Clonotype Diversity as a Biomarker for Colorectal Cancer

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Bioinformatics and Statistical Analyses for the TCR Repertoire

2.3. Supervised Machine Learning

2.4. Feature Importance Measures

3. Results

3.1. Comparison of TCR Repertoires Between CRC Patients and Healthy Controls

3.2. Performance of TCR Repertoire-Based ML Models for CRC Diagnosis

3.3. Identification of Key TCR Repertoire Biomarkers

3.4. Validation of TCR Repertoire-Based Models for CRC Diagnosis in Independent Cohorts

4. Discussion

4.1. Peripheral Blood TCR Repertoire Profiling Enables Accurate, Non-Invasive CRC Diagnosis

4.2. Key TCR Repertoire Features Underpin Robust Performance

4.3. Clinical Implications, Limitations, and Future Directions

5. Patents

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI