1. Introduction
Humans spend over 90 percent of their time indoors, whether at home or at work [
1]. Upon entering any indoor space, the air can instantly be contaminated with pathogens that may have traveled inside, whether through an open door or from a possible host carrying a disease. Contaminated air drastically increases the chance of infection via airborne transmission. Prevalent diseases, such as tuberculosis, asthma, and COVID-19, pose serious health risks as they sometimes lead to the death of their hosts. The World Health Organization (WHO) announced that an average of 3.8 million people lose their lives every year because of contaminated indoor air [
2].
Bacteria and fungi are the most common airborne pathogens. These organisms have specific ecological niches and can adapt to the environment. Living organisms produce proteins derived from a chain of peptides, which act as building blocks in many organisms. The roles of proteins include the creation of hormones that affect various parts of the organism for different purposes, such as reproduction and heart rate control. Proteins also participate in chemical reactions within an organism as catalysts in the form of enzymes responsible for speeding metabolism [
3]. Although proteins perform many valuable functions, they sometimes harm other organisms. These proteins are known as toxic proteins or toxins. These toxins act as virulence factors and cause diseases [
4,
5].
Due to globalization and advancements in transportation, the number of people moving from one place to another, often from country to country, is steadily increasing. These factors facilitate the spread of highly contagious and deadly diseases. Diseases often produce unpredictable outcomes, including spontaneous mutations, in which some variants become more contagious or deadly [
6,
7]. Therefore, it is crucial to understand mutations and predict the toxicity of a particular variant to create countermeasures. However, traditional in vivo and in vitro methods are time-consuming and expensive. In silico methods, however, provide faster results. Although they may not be sufficiently accurate, they can help guide researchers in identifying toxic sequences.
Studies in bioinformatics have revealed that incorporating deep-learning techniques to analyze genome and amino acid sequence data is often helpful in many subtasks. For instance, finding DNA-protein-interacting areas using reinforcement learning [
8] and predicting the 3D structure of a protein [
9] have significantly decreased the workload of many microbial studies. In addition, various methods are being employed to predict protein toxicity. Traditional machine learning methods, such as support vector machines (SVMs) and random forests (RFs), have been used for ToxinPred [
10]. Clantox uses boosted stump classifiers to classify toxic and nontoxic animal proteins. Deep learning techniques have also been used to predict toxic protein sequences [
11]. For instance, TOXIFY embeds toxic protein sequences using the Atchely factor matrix and runs it through a set of GRUs [
12]. ToxDL combines protein domain knowledge with features derived from a CNN module for prediction [
13]. ToxIBTL uses FEGS and the BLOSUM62 matrix to embed protein sequences, merge both features, and pass them through an information bottleneck layer [
14].
Language models developed for natural language processing have yielded promising results over the past few years. The transformer model suggested by Vaswani has outperformed the previous state-of-the-art models, with fewer required computations and higher bilingual evaluation understudy (BLEU) scores [
15]. Newer and better transformer-based models, such as bidirectional encoder representations from transformers (BERT) [
16], have proven that pretrained language models improve the performance of many natural language tasks, and these models have been used to solve other problems such as image classification and semantic segmentation [
17]. ProtBert is one of many target-specific BERT models. As suggested by Elnaggar, it has more computation layers than the original BERT implementation and is pretrained using protein sequences from UniRef and BFD [
18].
In this study, we propose the use of a fine-tuned ProtBert model to predict bacterial proteins that may act as virulence factors. We first tested the model on existing toxic protein datasets to determine whether it could outperform previous methods for toxic protein classification. We then trained the model on a new dataset, where we labeled virulence factors as toxic protein sequences, to determine whether the toxic-protein-prediction performance would improve when compared to using only toxic protein sequences for training. Finally, we applied the model to random protein sequences of four common bacteria found in indoor conditions.
4. Conclusions
In this study, we proposed the use of ProtBert for the prediction of toxic bacterial proteins. We tested our model on two public datasets and showed that it yields similar results as previous methods for animal toxic protein prediction and toxic bacterial protein prediction. We also trained the model using bacterial virulence factors to further investigate whether the model performance would improve when trained with much broader data. The results showed that our model could correctly classify toxic bacterial protein sequences. The in vitro experiments on unlabeled protein sequences revealed the possibility of finding new toxic protein sequences, and that the in silico method can capture possible toxic protein sequences.
It is noteworthy, however, that even though we could identify possible toxic proteins that may act as virulence factors, we can only presume that these proteins are responsible for hazardous reactions in the in vitro experiments. Hence, we intend to further investigate the link between the identified protein sequences and virulence data through more thorough in vitro experiments.
To strengthen the performance of the in silico protein toxicity prediction, we hope to add other features to the training of the model, such as evolutionary and protein chemical compositions, which are known to create harmful effects.