A Knowledge-Based Bidirectional Encoder Representation from Transformers to Predict a Paratope Position from a B Cell Receptor’s Amino Acid Sequence Alone

Park, Hyuntae; Lee, Kwang-Sig

doi:10.3390/app151810115

Open AccessArticle

A Knowledge-Based Bidirectional Encoder Representation from Transformers to Predict a Paratope Position from a B Cell Receptor’s Amino Acid Sequence Alone

by

Hyuntae Park

^1,*

and

Kwang-Sig Lee

^2,*

¹

Department of Obstetrics and Gynecology, Korea University College of Medicine, Seoul 02841, Republic of Korea

²

AI Center, Korea University College of Medicine, Seoul 02841, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10115; https://doi.org/10.3390/app151810115

Submission received: 11 August 2025 / Revised: 12 September 2025 / Accepted: 15 September 2025 / Published: 16 September 2025

(This article belongs to the Special Issue Novel Approaches for Machine Learning in Healthcare Applications: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Antibody function is an important topic for the understanding of disease, but it would be quite challenging to make an accurate prediction of a paratope position from very limited information such as a B cell receptor’s (BCR’s) amino acid sequence alone. In this context, this study presents a knowledge-based Bidirectional Encoder Representation from Transformers (K-BERT) to deliver a precise prediction of a paratope position from a B cell receptor’s amino acid sequence alone. Here, the knowledge context of an amino acid consisted of its predecessor amino acids. These knowledge contexts were either common among all amino acids within each BCR chain or different for different amino acids within each BCR chain. Also, oversampling was employed given that the original data of 20,679 cases (900 BCR chains) were characterized by class imbalance, i.e., 18,724:1955 for labels 0:1. The performance measures in terms of sensitivity and F1 registered great improvements as different knowledge contexts and oversampling were introduced. The accuracy, sensitivity, specificity and F1 of a baseline model (with common knowledge and no oversampling) were 90.3, 0.0, 100.0 and 50.0, respectively. On the other hand, the corresponding accuracy, sensitivity, specificity and F1 of the final model (with different knowledge and strong oversampling) were 83.2, 90.1, 78.1 and 84.1. The final model demonstrated better sensitivity and F1 outcomes compared to the baseline model, i.e., 90.1 vs. 0.0 for sensitivity, 84.1 vs. 50.0 for F1. In conclusion, the K-BERT is an effective decision support system to predict a paratope position from a B cell receptor’s amino acid sequence alone. It has great potential for antibody therapeutics.

Keywords:

knowledge-based; bidirectional encoder representation from transformers; paratope; B cell receptor; amino acid sequence

1. Introduction

The B-cell receptor (BCR) is a protein complex on the surface of a B cell. It enables the cell to recognize and bind to a specific antigen thereby initiating an immune response. It extends both outside the cell (above the cell plasma membrane) and inside the cell (below the cell plasma membrane). It consists of four elements, i.e., membrane-bound immunoglobulin molecule, signal transduction moiety, antigen binding site and co-receptors. The membrane-bound immunoglobulin has structural similarity with antibodies the cell secretes. The signal transduction moiety relays signals to the cell after antigen binding. Disulfide bridges connect these two components. The antigen binding site, variable regions of heavy and light chains, is highly specific to a particular antigen. Finally, the co-receptors promote the antigen binding and signaling. Here, the BCR plays the roles of antigen recognition, signal transduction, B cell activation, clonal diversity and central tolerance. It is crucial for recognizing a foreign antigen and initiating an adaptive immune response. It initiates a series of intracellular signals after antigen binding, which bring the activation, proliferation and differentiation of the cell. Based on variable–diversity–joining rearrangement, the cell generates diverse BCRs, which enable them to recognize diverse antigens. Finally, immature B cells with BCRs’ strong bindings to self-antigens are eliminated or edited [1,2,3,4,5,6,7,8,9,10].

On the other hand, generative artificial intelligence has become very popular in the past decade [11,12,13,14,15,16]. Generative artificial intelligence (“artificial intelligence generating image or text data” [11]) consists of generative adversarial networks (GANs) [12] and transformers [13,14,15,16]. Firstly, the GAN generates image data by training and validating two adversarial networks, i.e., the generative model vs. the discriminative model. The former network endeavors to generate artificial images from random noises and trick the latter network. The latter network strives to discriminate artificial images from real images (input images) and outsmart the former network [12]. Secondly, the original transformer with two components of the encoder and the decoder conducts the classification or generation of text data by combining positional information (positional vectors), context information (embedding vectors) and attention mechanisms. The encoder involves classification whereas the decoder involves generation [13]. Bidirectional Encoder Representations from Transformers (BERT) put more focus on classification just based on the encoder element of the transformer [14], while generative pretrained transformers (GPT) put more emphasis on generating just based on the decoder element of the transformer [15]. Finally, the vision transformer performs the classification or generation of image data, treating image patches as text tokens [16].

A recent review of 354 articles reports that these transformers are revolutionizing antibody design and optimization tools and systems [17]. Their applications cover a wide range of important tasks including affinity maturation [18], antibody language model [19], antibody sequence infilling [20], amino acid prediction at a specific position in a given sequence [21] and protein language model [22]. Specifically, a previous study collected 900 BCR chains from a public dataset and fine-tuned a BERT to predict a paratope position from a BCR’s amino acid sequence alone [23]. This study achieved an F1 score of 70% to surpass its predecessors as predictive artificial intelligence. This study made a rare contribution to the existing literature, given that previous studies centered on the application of machine learning models such as logistic regression, the decision tree, the naïve Bayes, the random forest, the support vector machine and the artificial neural network with physicochemical data [17,24]. However, this study did not consider the knowledge context of an amino acid within a BCR chain, which is expected to bring better performance [25,26,27,28,29]. Predecessor amino acids can serve as a good knowledge context for the accurate prediction of the amino acid within the BCR chain. In this context, this study presents a knowledge-based BERT (K-BERT) to deliver the precise prediction of a paratope position from a BCR’s amino acid sequence alone.

2. Methods

2.1. Data

Proteomics data in this study came from a public data source (https://github.com/alchemab/antiberta (accessed on 1 January 2025)). This study did not require the approval of the ethics committee given that data were publicly available (https://github.com/alchemab/antiberta (accessed on 1 January 2025)) and de-identified. This study applies generative artificial intelligence to bioinformatics data [30,31,32]. Bioinformatics can be defined as “a combination of biology and informatics for the collection, analysis and interpretation of genetic data” [30,31,32]. Bioinformatics data consists of three types: deoxyribonucleic acid (DNA) sequences including 4 nucleotides A, C, G and T [genomics]; ribonucleic acid (RNA) sequences including 4 nucleotides A, C, G and U [genomics]; and protein sequences consisting of 22 amino acids A, R, N, …, Y, V [proteomics]. A nucleotide has a nitrogenous basis, a pentose sugar and a phosphate group. DNA sequences are transcribed into RNA sequences, which are transcribed into protein sequences (the central dogma of current bioinformatics). That is, an amino acid of a protein can be expressed as RNA nucleotide triplets (codons) or their DNA counterparts, e.g., Ala (A) as GCU or GCC in Table 1 [33]. For instance, insulin—a protein chain of 110 amino acids (MALWMR … LENYCN)—can be expressed as its DNA counterpart of 110 nucleotide triplets (ATG GCC … TGC AAC). Amino acids M, A, C and N match their respective nucleotide triplets ATG, GCC, TGC and AAC in this example. Common bioinformatics databases are UniProt [34], GeneCards [35] and the GenBank [36].

2.2. Architecture

The BERT [14] with knowledge contexts (K-BERT) were adopted as the architecture in this study (https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1) (accessed on 1 January 2025). The original BERT and the K-BERT conduct the classification of text data in three steps: (1) embedding vectors; (2) attention mechanisms; and (3) final classification. In Step (1), input texts are preprocessed into embedding vectors of dimension 768, i.e., a set of 768 integers. Here, the term “embedding” denotes “context information” and 768 is a default value of the BERT. These embedding vectors can take only one branch as in the original BERT (e.g., an amino acid within a BCR chain) or multiple branches as in the K-BERT (e.g., an amino acid and its predecessors in a BCR chain). For example, in Figure 1, an amino acid “Q” (5th element of a BCR chain) and its predecessors “IALTQ” serve as the first and second branches of input texts, respectively. A preprocessing algorithm (“the BERT preprocessing model”) changes these two branches to two embedding vectors with dimension 768. The amino acid letter “Q” (first branch) can be preprocessed to an embedding vector [−754 −385 285 … 271 669 −378] with dimension 768. Likewise, its predecessors’ letters “IALTQ” (second branch) can be preprocessed to an embedding vector [−736 −190 −782 … −551 671 683] with dimension 768. These two embedding vectors are combined (or “concatenated” in python language) to one embedding vector with dimension 1536, which serves as initial input information for the K-BERT, e.g., [−754 −385 285 … 271 669 −378 −736 −190 −782 … −551 671 683] with dimension 1536. In Step (2), attention mechanisms are operated in a way that more (or less) weight is given to more (or less) relevant intermediate input information. In Step 3, final classification is performed based on final input information. The maximum length of an input text was limited to 128 (default value). For instance, the length of the amino acid letter “Q” (first branch) is 1, whereas the length of its predecessors’ letters “IALTQ” (second branch) is 5. Their lengths are shorter than 128.

2.3. Analysis

Four types of K-BERT were trained (fine-tuned) for 10 epochs and tested for the prediction of a paratope, i.e., Models 1–4. The knowledge contexts of Model 1 were common among all amino acids within each BCR chain, while the knowledge contexts of Models 2–4 were different for different amino acids within each BCR chain (Figure 1). Here, the knowledge context of an amino acid consisted of its predecessor amino acids. The original data of 20,679 cases (900 BCR chains) were characterized by class imbalance, i.e., 18,724:1955 for labels 0:1. Models 3 and 4 involved oversampling, whereas Models 1 and 2 did not. Training–test split was performed on an amino acid level as described below.

Model 1 (No Oversampling Common Knowledge Context): This model served as the baseline model. This baseline model used the original data with no oversampling. The original 20,679 cases were split into training and test sets with a 50:50 ratio (10,356 vs. 10,323 cases). The training and test sets consisted of 9381:975 and 9343:980 cases for labels 0:1, respectively. The knowledge contexts of this baseline model were common among all amino acids within each BCR chain as shown in Figure 1.
Model 2 (No Oversampling Different Knowledge Context): This model was similar to Model 1 (baseline model) given that this model employed the original data with no oversampling. The original 20,679 cases were divided into training and test sets with a 50:50 ratio (10,356 vs. 10,323 cases). The training and test sets included 9381:975 and 9343:980 cases for labels 0:1, correspondingly. However, the knowledge contexts of this model were different for different amino acids within each BCR chain as presented in Figure 1.
Model 3 (Weak Oversampling Different Knowledge Context): The original 18,724:1955 cases for labels 0:1 were oversampled with 18,724:3165 for labels 0:1. Weak oversampling was conducted, i.e., 1210 positive cases were created based on the Synthetic Minority Oversampling Technique. Then, these 21,889 cases were split into training and test sets with a 50:50 ratio (10,961 vs. 10,928 cases). The training and test sets consisted of 9381:1580 and 9343:1585 cases for labels 0:1, respectively. The knowledge contexts were different for different amino acids within each BCR chain.
Model 4 (Strong Oversampling Different Knowledge Context): The original 18,724:1955 cases for labels 0:1 were oversampled with 18,724:14,055 for labels 0:1. Strong oversampling was performed, that is, 12,100 positive cases were created based on the Synthetic Minority Oversampling Technique. Then, these 32,779 cases were divided into training and test sets with a 50:50 ratio (16,406 vs. 16,373 cases). The training and test sets included 9381:7025 and 9343:7030 cases for labels 0:1, correspondingly. The knowledge contexts were different for different amino acids within each BCR chain.

The criteria for testing the trained models were accuracy, sensitivity and specificity, i.e., ratios of correct predictions among all cases, positive cases and negative cases. Python 3.8.17, Tensorflow 2.13.0 and Keras 2.13.1. were used for the analysis during 1 January 2025–31 July 2025 (https://github.com/fchollet/keras (accessed on 1 January 2025)).

3. Results

3.1. Background Information

Table 2 summarizes model performance. Figure 2 (or Figure 3) shows model performance for the training (or test) set. As addressed above, the knowledge contexts of Model 1 were common among all amino acids within each BCR chain, whereas the knowledge contexts of Models 2–4 were different for different amino acids within each BCR chain. Here, the knowledge context of an amino acid was composed of its predecessor amino acids. Models 3 and 4 involved oversampling, whereas Models 1 and 2 did not.

3.2. Model Performance

The performance measures in terms of sensitivity and F1 (harmonic mean of sensitivity and specificity) registered great improvements as different knowledge contexts and oversampling were introduced. The test-set accuracy, sensitivity, specificity and F1 of Model 1 (a baseline model with common knowledge and no oversampling) were 90.3, 0.0, 100.0 and 50.0, respectively. On the other hand, the corresponding test-set accuracy, sensitivity, specificity and F1 of Model 4 (the final model with different knowledge and strong oversampling) were 83.2, 90.1, 78.1 and 84.1.

Meanwhile, the test-set accuracy, sensitivity, specificity and F1 of Model 2 (an intermediate model with different knowledge but no oversampling) were 73.4, 29.1, 78.1 and 53.6, correspondingly. Likewise, the respective test-set accuracy, sensitivity, specificity and F1 of Model 3 (an intermediate model with different knowledge and weak oversampling) were 74.1, 45.1, 78.1 and 61.6. Models 2–4 demonstrated better sensitivity and F1 outcomes compared to Model 1, i.e., 29.1, 45.1, 90.1 vs. 0.0 for sensitivity, 53.6, 61.6, 84.1 vs. 50.0 for F1.

4. Discussion

4.1. Summary

This study presents a knowledge-based Bidirectional Encoder Representation from Transformers (K-BERT) to deliver a precise prediction of a paratope position from a B cell receptor’s amino acid sequence alone. The performance measures in terms of sensitivity and F1 registered great improvements as different knowledge contexts and oversampling were introduced. The accuracy, sensitivity, specificity and F1 of the baseline model (with common knowledge and no over-sampling) were 90.3, 0.0, 100.0 and 50.0, respectively. On the other hand, the corresponding accuracy, sensitivity, specificity and F1 of the final model (with different knowledge and strong oversampling) were 83.2, 90.1, 78.1 and 84.1. The final model demonstrated better sensitivity and F1 outcomes compared to the baseline model, i.e., 90.1 vs. 0.0 for sensitivity, 84.1 vs. 50.0 for F1.

4.2. Contributions

This study made a rare attempt and success for a K-BERT with oversampling to predict a paratope position from a B cell receptor’s amino acid sequence alone. As addressed above, transformers are revolutionizing antibody design and optimization tools and systems [17]. Their applications cover a wide range of important tasks including affinity maturation [18], antibody language model [19], antibody sequence infilling [20], amino acid prediction at a specific position in a given sequence [21] and protein language model [22]. Specifically, a previous study collected 900 BCR chains from a public dataset and fine-tuned a BERT to predict a paratope position from a BCR’s amino acid sequence alone [23]. This research achieved an F1 score of 70% to surpass its predecessors as predictive artificial intelligence. This research made a rare contribution to the existing literature, given that previous studies centered on the application of machine learning models such as logistic regression, the decision tree, the naïve Bayes, the random forest, the support vector machine and the artificial neural network with physicochemical data [17,24].

However, it can be pointed out that the previous study above [23] did not consider the knowledge contexts of amino acids within a BCR chain. In fact, the BERT with knowledge contexts (as the K-BERT in this study) is expected to bring better performance [25,26,27,28,29]. A recent survey of 47 articles on knowledge-based deep neural network language models reveals that the application of knowledge-based deep learning language models has been limited to business informatics such as knowledge-based conversational agents [26]. A rare clinical application of knowledge-based deep learning language models was performed on knowledge-based conversational agents [28]. However, little literature is available, and more effort is needed on knowledge-based deep learning language models to predict a paratope position from a B cell receptor’s amino acid sequence alone. This study broke new ground in this line of research. This study used the same data (900 BCR chains) but achieved an F1 score of 84%, far beyond the F1 score of 70% in the previous research [23].

4.3. Limitations

This study had some limitations. Firstly, extending this study to the visual BERT would break new ground in this direction. As addressed above, the original transformer conducts the classification or generation of text data by combining positional information (positional vectors), context information (embedding vectors) and attention mechanisms. On the other hand, the vision transformer performs the classification or generation of image data, treating image patches as text tokens [16]. Developing and validating the visual BERT (as a vision transformer) based on both text and image data is expected to make great contributions to the validity and reliability of this study. This looks quite apparent given that current research based on image data has been limited to convolutional neural networks [37].

Secondly, reinforcement learning (RL) was beyond the scope of this study [38,39,40,41,42,43]. RL includes three elements, i.e., the environment brings rewards, an agent takes actions for maximum rewards, and the environment changes to the next period with given probabilities [38]. The RL agent (e.g., Alpha-Go) starts like a human agent, taking actions and maximizing rewards (e.g., the chances of victory) just based on limited information in limited periods. But the RL agent evolves far beyond the best human agent ever from the magnificent power of big data encompassing all human agents before [38]. RL became popular in finance and health, given that it does not require unrealistic assumptions but delivers superior performance [39,40]. These successes were replicated in business informatics such as conversational agents [41,42,43]. However, little examination was performed and more investigation is needed on RL for knowledge-based language models to predict a paratope position from a B cell receptor’s amino acid sequence alone.

Thirdly, adding other knowledge contexts than predecessor amino acids (e.g., complementary determining regions) would improve the outcomes of this study much more. Fourthly, the data was small (900 BCR chains) and training–test split was performed only once for each model. Cross-validation with an independent validation set is expected to further the boundary of knowledge on this topic. Fifthly, conducting various ratios of training–test split would make the outcomes of this analysis much more convincing. This study used the same data (900 BCR chains) but employed a greater size of test data compared to a previous study [23], i.e., 450 vs. 90 BCR chains. A greater size of training data would improve model performance whereas a greater size of test data would strengthen external validity. Finding optimal ratios of training–test split would be a great research topic in this line of research.

Sixthly, oversampling is likely to improve sensitivity but deteriorate generalizability with overfitting. In this context, the importance of expanding the original data cannot be overstated. Finally, it can be noted that there exists a trade-off between sensitivity and specificity with oversampling in this line of research. For example, the final model with oversampling (Model 4) demonstrated better sensitivity and F1 outcomes compared to the baseline model without oversampling (Model 1) in this study, i.e., 90.1 vs. 0.0 for sensitivity, 84.1 vs. 50.0 for F1. But the opposite was true for specificity, i.e., 78.1 vs. 100.0. F1 was introduced as a compromise between rising sensitivity and falling specificity with oversampling here. However, the best model is supposed to exceed in both sensitivity and specificity. A strong claim on the best model can be made only with this condition. In this vein, the interpretation of the best model (Model 4) in this study needs to be taken with appropriate caution.

5. Conclusions

The K-BERT is an effective decision support system to predict a paratope position from a B cell receptor’s amino acid sequence alone. It has great potential for antibody therapeutics.

Author Contributions

Conceptualization, H.P. and K.-S.L.; methodology, H.P. and K.-S.L.; software, H.P. and K.-S.L.; validation, H.P. and K.-S.L.; formal analysis, H.P. and K.-S.L.; investigation, H.P. and K.-S.L.; resources, H.P. and K.-S.L.; data curation, H.P. and K.-S.L.; writing—original draft preparation, H.P. and K.-S.L.; writing—review and editing, H.P. and K.-S.L.; visualization, H.P. and K.-S.L.; supervision, H.P. and K.-S.L.; project administration, H.P. and K.-S.L.; funding acquisition, H.P. and K.-S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by (1) the Korea University Anam Hospital grant (K.-S.L.) funded by Korea University. Also, this work was supported by (2) the Korea Health Industry Development Institute grant (No. RS-2025-02243104) (H.P.), (3) the Patient-Centered Clinical Research Coordinating Center grant (No. RS-2024-0039-0039-8374) (H.P.) and (4) the Korea Health Industry Development Institute grant [Korea Health Technology R&D Project] (No. RS-2022-KH129263) (K.-S.L.) funded by the Ministry of Health and Welfare of South Korea. The funder had no role in the design of the study, in the collection, analysis and interpretation of the data, or the writing and review of the manuscript.

Institutional Review Board Statement

This study did not require the approval of the ethics committee given that data were publicly available (https://github.com/alchemab/antiberta (accessed on 1 January 2025)) and de-identified.

Informed Consent Statement

Not applicable.

Data Availability Statement

Proteomics data in this study came from a public data source (https://github.com/alchemab/antiberta (accessed on 1 January 2025)).

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Hardy, R.R.; Hayakawa, K. B cell development pathways. Annu. Rev. Immunol. 2001, 19, 595–621. [Google Scholar] [CrossRef]
Küppers, R. Mechanisms of B-cell lymphoma pathogenesis. Nat. Rev. Cancer 2005, 5, 251–262. [Google Scholar] [CrossRef] [PubMed]
Brezski, R.J.; Monroe, J.G. B-cell receptor. Adv. Exp. Med. Biol. 2008, 640, 12–21. [Google Scholar] [PubMed]
Treanor, B. B-cell receptor: From resting state to activate. Immunology 2012, 136, 21–27. [Google Scholar] [CrossRef] [PubMed]
Hua, Z.; Hou, B. TLR signaling in B-cell development and activation. Cell. Mol. Immunol. 2013, 10, 103–106. [Google Scholar] [CrossRef]
Rawlings, D.J.; Metzler, G.; Wray-Dutra, M.; Jackson, S.W. Altered B cell signalling in autoimmunity. Nat. Rev. Immunol. 2017, 17, 421–436. [Google Scholar] [CrossRef]
Burger, J.A.; Wiestner, A. Targeting B cell receptor signalling in cancer: Preclinical and clinical advances. Nat. Rev. Cancer 2018, 18, 148–167. [Google Scholar] [CrossRef]
Tanaka, S.; Baba, Y. B cell receptor signaling. Adv. Exp. Med. Biol. 2020, 1254, 23–36. [Google Scholar]
Tkachenko, A.; Kupcova, K.; Havranek, O. B-cell receptor signaling and beyond: The role of Igalpha (CD79a)/Igbeta (CD79b) in normal and malignant B cells. Int. J. Mol. Sci. 2023, 25, 10. [Google Scholar] [CrossRef]
Yuuki, H.; Itamiya, T.; Nagafuchi, Y.; Ota, M.; Fujio, K. B cell receptor repertoire abnormalities in autoimmune disease. Front. Immunol. 2024, 15, 1326823. [Google Scholar] [CrossRef]
Lee, K.S.; Kim, E.S. Generative artificial intelligence in the early diagnosis of gastrointestinal disease. Appl. Sci. 2024, 14, 11219. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT, Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
OpenAI. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 x 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Dewaker, V.; Morya, V.K.; Kim, Y.H.; Park, S.T.; Kim, H.S.; Koh, Y.H. Revolutionizing oncology: The role of Artificial Intelligence (AI) as an antibody design, and optimization tools. Biomark. Res. 2025, 13, 52. [Google Scholar] [CrossRef] [PubMed]
Ruffolo, J.A.; Gray, J.J.; Sulam, J. Deciphering antibody affinity maturation with language models and weakly supervised learning. arXiv 2021, arXiv:2112.07782. [Google Scholar] [CrossRef]
Kenlay, H.; Dreyer, F.A.; Kovaltsuk, A.; Miketa, D.; Pires, D.; Deane, C.M. Large scale paired antibody language models. arXiv 2024, arXiv:2403.17889. [Google Scholar] [CrossRef] [PubMed]
Melnyk, I.; Chenthamarakshan, V.; Chen, P.Y.; Das, P.; Dhurandhar, A.; Padhi, I.; Das, D. Reprogramming pretrained language models for antibody sequence infilling. Proc. Mach. Learn. Res. 2023, 202, 24398–24419. [Google Scholar]
Hadsund, J.T.; Satława, T.; Janusz, B.; Shan, L.; Zhou, L.; Röttger, R.; Krawczyk, K. nanoBERT: A deep learning model for gene agnostic navigation of the nanobody mutational space. Bioinform. Adv. 2024, 4, vbae033. [Google Scholar] [CrossRef]
Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Towards cracking the language of life’s code through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7112–7127. [Google Scholar] [CrossRef]
Leem, J.; Mitchell, L.S.; Farmery, J.H.R.; Barton, J.; Galson, J.D. Deciphering the language of antibodies using self-supervised learning. Patterns 2022, 3, 100513. [Google Scholar] [CrossRef]
Greiff, V.; Weber, C.R.; Palme, J.; Bodenhofer, U.; Miho, E.; Menzel, U.; Reddy, S.T. Learning the high-dimensional immunogenomic features that predict public and private antibody repertoires. J. Immunol. 2017, 199, 2985–2997. [Google Scholar] [CrossRef] [PubMed]
Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A. Language models as knowledge bases? arXiv 2019, arXiv:1909.01066. [Google Scholar] [CrossRef]
Safavi, T.; Koutra, D. Relational world knowledge representation in contextual language models: A review. arXiv 2021, arXiv:2104.05837. [Google Scholar] [CrossRef]
Yu, W.; Iter, D.; Wang, S.; Xu, Y.; Ju, M.; Sanyal, S.; Zhu, C.; Zeng, M.; Jiang, M. Generate rather than retrieve: Large language models are strong context generators. arXiv 2022, arXiv:2209.10063. [Google Scholar]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
Ju, T.; Sun, W.; Du, W.; Yuan, X.; Ren, Z.; Liu, G. How large language models encode context knowledge? A layer-wise probing study. arXiv 2024, arXiv:2402.16061. [Google Scholar] [CrossRef]
Karim, M.R.; Islam, T.; Shajalal, M.; Beyan, O.; Lange, C.; Cochez, M.; Rebholz-Schuhmann, D.; Decker, S. Explainable AI for bioinformatics: Methods, tools and applications. Brief. Bioinform. 2023, 24, bbad236. [Google Scholar] [CrossRef]
Vilhekar, R.S.; Rawekar, A. Artificial Intelligence in Genetics. Cureus 2024, 16, e52035. [Google Scholar] [CrossRef]
Claverie, J.M.; Cedric Notredame, C. Bioinformatics for Dummies, 2nd ed.; Wiley: Indianapolis, IN, USA, 2007. [Google Scholar]
Pevsner, J. Bioinformatics and Functional Genomics, 3rd ed.; Wiley: Oxford, UK, 2015. [Google Scholar]
European Molecular Biology Laboratory-European Bioinformatics Institute. UniProt. Available online: https://www.uniprot.org/ (accessed on 1 June 2025).
Weizmann Institute of Science. GeneCards. Available online: https://www.genecards.org/ (accessed on 1 June 2025).
National Center for Biotechnology Information. GenBank. Available online: https://www.ncbi.nlm.nih.gov/genbank/about/ (accessed on 1 June 2025).
Li, D.; Pucci, F.; Rooman, M. Prediction of paratope-epitope pairs using convolutional neural networks. Int. J. Mol. Sci. 2024, 25, 5434. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Hambly, B.; Xu, R.; Yang, H. Recent advances in reinforcement learning in finance. arXiv 2022, arXiv:2112.04553. [Google Scholar]
Yu, C.; Liu, J.; Nemati, S. Reinforcement learning in healthcare: A survey. arXiv 2020, arXiv:1908.08796. [Google Scholar] [CrossRef]
Dognin, P.; Padhi, I.; Melnyk, I.; Das, P. ReGen: Reinforcement learning for text and knowledge base generation using pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1084–1099. [Google Scholar]
Pang, J.C.; Yang, S.H.; Li, K.; Zhang, J.; Chen, X.H.; Tang, N.; Yu, Y. Knowledgeable agents by offline reinforcement learning from large language model rollouts. arXiv 2024, arXiv:2404.09248. [Google Scholar] [CrossRef]
Wang, Y.; Yang, Q.; Zeng, Z.; Ren, L.; Liu, L.; Peng, B.; Cheng, H.; He, X.; Wang, K.; Gao, J.; et al. Reinforcement learning for reasoning in large language models with one training example. arXiv 2025, arXiv:2504.20571. [Google Scholar] [CrossRef]

Figure 1. Knowledge contexts: common vs. different. Legend: The knowledge contexts of Model 1 were common among all amino acids within each BCR chain, while the knowledge contexts of Models 2–4 were different for different amino acids within each BCR chain. Here, the knowledge context of an amino acid consisted of its predecessor amino acids.

Figure 2. Model performance—training set.

Figure 3. Model performance—test set.

Table 1. RNA and DNA codons.

Amino Acid	RNA Codons	DNA Codons
Ala A	GCU, GCC, GCA, GCG	GCT, GCC, GCA, GCG
Arg R	CGU, CGC, CGA, CGG; AGA, AGG	CGT, CGC, CGA, CGG; AGA, AGG
Asn N	AAU, AAC	AAT, AAC
Asp D	GAU, GAC	GAT, GAC
Asn/Asp B	AAU, AAC; GAU, GAC	AAT, AAC; GAT, GAC
Cys C	UGU, UGC	TGT, TGC
Gln Q	CAA, CAG	CAA, CAG
Glu E	GAA, GAG	GAA, GAG
Gln/Glu Z	CAA, CAG; GAA, GAG	CAA, CAG; GAA, GAG
Gly G	GGU, GGC, GGA, GGG	GGT, GGC, GGA, GGG
His H	CAU, CAC	CAT, CAC
Ile I	AUU, AUC, AUA	ATT, ATC, ATA
Leu L	CUU, CUC, CUA, CUG; UUA, UUG	CTT, CTC, CTA, CTG; TTA, TTG
Lys K	AAA, AAG	AAA, AAG
Met M	AUG	ATG
Phe F	UUU, UUC	TTT, TTC
Pro P	CCU, CCC, CCA, CCG	CCT, CCC, CCA, CCG
Ser S	UCU, UCC, UCA, UCG; AGU, AGC	TCT, TCC, TCA, TCG; AGT, AGC
Thr T	ACU, ACC, ACA, ACG	ACT, ACC, ACA, ACG
Trp W	UGG	TGG
Tyr Y	UAU, UAC	TAT, TAC
Val V	GUU, GUC, GUA, GUG	GTT, GTC, GTA, GTG
Start	AUG, CUG, UUG	ATG, TTG, GTG, CTG
Stop	UAA, UGA, UAG	TAA, TGA, TAG

Table 2. Model performance.

Data	Metric	Model 1	Model 2	Model 3	Model 4
Training	Accuracy	90.1	60.2	60.2	62.9
	Sensitivity	2.0	32.8	48.0	90.1
	Specificity	98.8	76.5	76.5	76.5
	F1	50.4	54.7	62.3	83.3
Test	Accuracy	90.3	73.4	74.1	83.2
	Sensitivity	0.0	29.1	45.1	90.1
	Specificity	100.0	78.1	78.1	78.1
	F1	50.0	53.6	61.6	84.1

The performance measures in terms of sensitivity and F1 (harmonic mean of sensitivity and specificity) registered great improvements as different knowledge contexts and oversampling were introduced. The test-set accuracy, sensitivity, specificity and F1 of Model 1 (a baseline model with common knowledge and no oversampling) were 90.3, 0.0, 100.0 and 50.0, respectively. On the other hand, the corresponding test-set accuracy, sensitivity, specificity and F1 of Model 4 (the final model with different knowledge and strong oversampling) were 83.2, 90.1, 78.1 and 84.1.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, H.; Lee, K.-S. A Knowledge-Based Bidirectional Encoder Representation from Transformers to Predict a Paratope Position from a B Cell Receptor’s Amino Acid Sequence Alone. Appl. Sci. 2025, 15, 10115. https://doi.org/10.3390/app151810115

AMA Style

Park H, Lee K-S. A Knowledge-Based Bidirectional Encoder Representation from Transformers to Predict a Paratope Position from a B Cell Receptor’s Amino Acid Sequence Alone. Applied Sciences. 2025; 15(18):10115. https://doi.org/10.3390/app151810115

Chicago/Turabian Style

Park, Hyuntae, and Kwang-Sig Lee. 2025. "A Knowledge-Based Bidirectional Encoder Representation from Transformers to Predict a Paratope Position from a B Cell Receptor’s Amino Acid Sequence Alone" Applied Sciences 15, no. 18: 10115. https://doi.org/10.3390/app151810115

APA Style

Park, H., & Lee, K.-S. (2025). A Knowledge-Based Bidirectional Encoder Representation from Transformers to Predict a Paratope Position from a B Cell Receptor’s Amino Acid Sequence Alone. Applied Sciences, 15(18), 10115. https://doi.org/10.3390/app151810115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Knowledge-Based Bidirectional Encoder Representation from Transformers to Predict a Paratope Position from a B Cell Receptor’s Amino Acid Sequence Alone

Abstract

1. Introduction

2. Methods

2.1. Data

2.2. Architecture

2.3. Analysis

3. Results

3.1. Background Information

3.2. Model Performance

4. Discussion

4.1. Summary

4.2. Contributions

4.3. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI