1. Introduction
Over centuries, the living and working environments of human beings have gradually shifted from outdoor to indoor. At present, the majority of people spend approximately 20 h indoors. Although it is advantageous to stay indoors for protection from rain, heat, or other environmental factors, staying indoors for prolonged periods may result in certain health hazards. Indoor air pollution is the cause of various cardiovascular and respiratory diseases, which accounted for 3.2 million deaths in 2020 [
1,
2,
3,
4]. Indoor air pollution is caused by burning combustion devices, new furniture, and tobacco, which release chemical pollutants such as carbon monoxide and sulfur dioxide. There are also biological pollutants, which include allergens, such as animal fur and house dust mites, and microbes, such as viruses, bacteria, and fungi.
Type A influenza is considered seasonal in a majority of the Korean population. The growing number of patients each year has increased the awareness regarding the prevention of bacterial diseases. For viral diseases, the 2015 MERS [
5,
6,
7] outbreak followed by the 2019 COVID-19 pandemic [
8,
9,
10] has prompted research on viral outbreak prevention. In contrast, fungal infections are often neglected, owing to few reported cases. However, this does lower the threat posed by fungal infections to human health. Aspergillosis, caused by the common household mold
Aspergillus, may not be an imminent threat to healthy individuals. However, for individuals with a weakened immune system, allergic reactions or lung damage may occur [
11,
12]. More fatal diseases include
Pneumocystis pneumonia, which is caused by
Pneumocystis jirovecii [
13,
14]. Reports from the Center for Disease Control and Prevention (CDC) highlight an increase in reported fungal infections in the US, cautioning of a possible fungal disease outbreak [
15].
Protein sequencing plays a crucial role in understanding the structure and biological function of proteins. In particular, identifying the functions of new microbes is crucial as the slightest mutation may cause microbes to act in a hazardous manner. However, traditional protein sequence analysis methods, such as Edman degradation and X-ray crystallography, require a significant amount of time and resources. This results in the challenge of deciding which microbes are worth analyzing, because it is inefficient to invest in research on microbes that are not well known or perceived to be harmless to other organisms. In silico modeling can be used to address this issue, as state-of-the-art computer simulations can provide a rough estimate of the function of a given protein sequence input. These results may not provide an accurate insight into the protein’s function but can guide the in vitro and in vivo researchers to devote their resources to other, more likely proteins.
In recent years, deep learning technology has resulted in many innovations in computer vision and natural language processing. The transformer model established in 2017 [
16] provides an attention mechanism for machine text translation. Many deep learning models that adapt their architecture have been proposed. BERT, RoBERTa, and DistilBERT focus on creating contextualized word embeddings through multiple encoder attention blocks from the original transformer model [
17,
18,
19]. On the contrary, the generative pre-trained transformer (GPT) relies significantly on the decider region of the transformer and is typically used for various generation tasks such as question and answering [
20,
21].
Because of the versatility of many large language models, any data in the form of text contain contextual data that can be used for pre-training. These models include ChemBERTa [
22], MolBERT [
23], and SolvBERT [
24] from the field of molecular representation learning, which uses simplified molecular-input line-entry system (SMILES) data. Protein sequences can also be trained because they share many similarities with human text, such as repetitive regions and contextual data [
25]. Using publicly available protein sequence data, several models have been proposed, such as ProtBERT, ProtT5, and Ankh [
26,
27].
Such large language models are pre-trained on massive amounts of data and, often, for many deep learning applications, fine-tuning the pre-trained weights with a specific dataset is sufficient for yielding satisfactory results. A major problem in fine-tuning is that, in the absence of layer freezing, all the parameters must be trained. This task is not only time-consuming but also increases the hardware barrier for anyone willing to fine-tune the model for their application. Thus, for a cost-effective fine-tuning method, parameter-efficient fine-tuning (PEFT) was introduced. The concept of PEFT is based on the idea that all the parameters of a pre-trained model are frozen. By adding and training a few trainable parameters, results similar to those of a fine-tuned model can be obtained; however, this results in a drastic reduction in the trainable parameters. This was proven in [
28], where adaptors included additional trainable layers in the transformer block. Low-rank adaptation (LoRA) further reduces the parameters by breaking down the adaptation mechanism and optimizing the rank-decomposition matrices [
29].
In this study, we attempted to verify the reliability of deep learning models as an appropriate in silico method for predicting the toxicity of fungal species by comparing the prediction results to those of the in vitro experiments. To train the in silico model, we focused on two major tasks. Initially, we trained the in silico model on fungal protein data and then improved its time efficiency using PEFT. For the in vitro experiments, we evaluated the cytotoxicity of fungal species with a 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide (MTT) assay.
4. Discussion
We hypothesized that applying LoRA to ProtBERT would improve the model efficiency at the cost of decreasing the model performance. However, the in silico training results presented in
Section 3.1 proved otherwise, as all the performance measures increased as compared with those of normal fine-tuning. We presume that this is because of the difference in the number of trainable parameters and how the LoRA works. Normal fine-tuned ProtBERT has too many parameters to train as compared to that for ProtBERT with LoRA, which has approximately 400 times fewer parameters. With fewer parameters to train, the likelihood of catastrophic forgetting reduces. Additionally, LoRA trains only newly added layers and does not include the original pre-trained ProtBERT weights. Therefore, the pre-trained weights of the ProtBERT model are already capable of extracting useful contextual data.
The indoor airborne fungal species predicted by our algorithm to harbor potential toxic and/or virulent proteins includes Alternaria alternata, Aspergillus niger, Chaetomium globosum, Fusarium equiseti, Fusarium proliferatum, Neurospora tetrasperma, Penicillium brasilianum, Penicillium chrysogenum, Penicillium oxalicum, Phanerochaete sordida, Schizophyllum commune, Trichoderma harzianum, and they exhibited toxic activity in the two human cell lines used in this study. Known as plant pathogens, they are subjects of ongoing research in various areas, including basic research or biotechnology related to enzymes and genomics. While the fungal spores that they produce, like those of most fungi, can induce allergic reactions in humans, their pathogenic potential in humans remains to be further explored. Our in vitro cellular viability results closely align with in silico predictions, suggesting a potential methodology for evaluating the in vitro cytotoxicity of fungi present in indoor air, combining in silico prediction with experimental assays. Despite limited data on fungal protein sequences for training transformer models, this pilot study successfully developed an in silico prediction module, running in parallel with in vitro cytotoxicity evaluation. These efforts contribute to the advancement of technology development for a swift understanding of unidentified fungi floating indoors, which might pose threats or exacerbate human health.
5. Conclusions
In this study, we improved the in silco model performance and assessed the reliability of using ProtBERT for fungi toxicity prediction. In improving model performance, we applied LoRA to the ProtBERT model.The in silico experimental results showed that ProtBERT with LoRA outperformed the normal fine-tuning method. Using the trained in silico model, we compared the toxicity prediction of fungal species with our in vitro experimental results. The results of the toxicity prediction of the fungal species using the in silico model showed that there may be possible protein sequences whose functions are unknown that may present fungal toxicity. In vitro experiments reveal A. alternata, F. proliferatum, and P. brasilianum as possible toxic fungal species.
By comparing the possibly toxic proteins of these fungal species with those in our in silico results, we presume that certain unknown proteins predicted to be either toxic or virulent may be the cause. In the future, we plan to confirm this hypothesis by performing additional in vitro and in vivo experiments.