Next Article in Journal
Protective Features of Autophagy in Pulmonary Infection and Inflammatory Diseases
Next Article in Special Issue
Retroelement—Linked Transcription Factor Binding Patterns Point to Quickly Developing Molecular Pathways in Human Evolution
Previous Article in Journal
MicroRNAs at the Interface between Osteogenesis and Angiogenesis as Targets for Bone Regeneration
Previous Article in Special Issue
Large-Scale Assessment of Bioinformatics Tools for Lysine Succinylation Sites
Article Menu
Issue 2 (February) cover image

Export Article

Open AccessArticle
Cells 2019, 8(2), 122;

A High Efficient Biological Language Model for Predicting Protein–Protein Interactions

Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China
University of Chinese Academy of Sciences, Beijing 100049, China
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Received: 27 December 2018 / Revised: 26 January 2019 / Accepted: 2 February 2019 / Published: 3 February 2019
(This article belongs to the Special Issue Bioinformatics and Computational Biology 2019)
Full-Text   |   PDF [1624 KB, uploaded 3 February 2019]   |  
  |   Review Reports


Many life activities and key functions in organisms are maintained by different types of protein–protein interactions (PPIs). In order to accelerate the discovery of PPIs for different species, many computational methods have been developed. Unfortunately, even though computational methods are constantly evolving, efficient methods for predicting PPIs from protein sequence information have not been found for many years due to limiting factors including both methodology and technology. Inspired by the similarity of biological sequences and languages, developing a biological language processing technology may provide a brand new theoretical perspective and feasible method for the study of biological sequences. In this paper, a pure biological language processing model is proposed for predicting protein–protein interactions only using a protein sequence. The model was constructed based on a feature representation method for biological sequences called bio-to-vector (Bio2Vec) and a convolution neural network (CNN). The Bio2Vec obtains protein sequence features by using a “bio-word” segmentation system and a word representation model used for learning the distributed representation for each “bio-word”. The Bio2Vec supplies a frame that allows researchers to consider the context information and implicit semantic information of a bio sequence. A remarkable improvement in PPIs prediction performance has been observed by using the proposed model compared with state-of-the-art methods. The presentation of this approach marks the start of “bio language processing technology,” which could cause a technological revolution and could be applied to improve the quality of predictions in other problems. View Full-Text
Keywords: protein–protein interactions; bio-language processing; sentencepiece; convolution neural network; unigram language model protein–protein interactions; bio-language processing; sentencepiece; convolution neural network; unigram language model

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Supplementary material


Share & Cite This Article

MDPI and ACS Style

Wang, Y.; You, Z.-H.; Yang, S.; Li, X.; Jiang, T.-H.; Zhou, X. A High Efficient Biological Language Model for Predicting Protein–Protein Interactions. Cells 2019, 8, 122.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Cells EISSN 2073-4409 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top