1. Introduction
Contemporary peptide science encompasses the biological and chemical approach. Medical sciences, pharmacology, biotechnology, and last, but not least, food and nutrition sciences need both biology and chemistry. Peptides are in the focus of interest of all the above-mentioned areas.
Tools for in silico peptide research, such as databases and programs, utilize both of the approaches that are classified as bioinformatics and cheminformatics, respectively, although most of the specialized databases and programs that are dedicated for peptides may be classified as bioinformatic tools [
1]. Both of the approaches use different languages to describe the structures of biomolecules (e.g., peptides) [
2,
3]; bioinformatics operates based on amino acid sequences, whereas cheminformatics, on universal chemical codes. Communication between these two areas requires translation from the biological into the chemical language [
2,
3]. Dataset concerning biological activities of taste-affecting peptides, published in our review [
4], may serve as an example of benefits from merging the biological and chemical approaches. It could not be completed without screening databases using peptide structures, annotated in the SMILES code [
5,
6], as a query. Another example of in silico research, utilizing chemical approach, has been recently published by Ortiz-Martinez and others [
7]. They used the SwissTargetPrediction program [
8,
9], as provided by the Swiss Institute of Bioinformatics, Lausanne, Switzerland, to predict interactions between small peptides from maize and proteins of human organism. Chemical modifications of peptides, aimed to alter their biological activity, may recently be considered as a “hot topic” [
10,
11,
12,
13,
14]. Processing of the peptide sequences, including non-protein or modified amino acids, is possible, e.g., using PepstrMod program [
15,
16] (provider Institute of Microbial Technology, Chandigargh, India). The above program utilizes hundreds of non-protein or modified amino acid residues. Possible space of non-natural or modified amino acids and other possible constituents of peptides, contains however even billions of possible molecules or molecule fragments [
17]. SMILES and other chemical codes and formats enable the description of any artificially inserted substituents for in silico study of properties of modified peptides. Another approach is the search for peptidomimetics, which are potentially useful as drugs [
18], on the basis of known peptide structures.
Many programs, utilizing the SMILES code, are recently available, including, e.g., BioTriangle program provided by the Central South University, Changsha, China [
19,
20], which serves to calculate, e.g., physicochemical and topological parameters of small molecules. Some programs that are utilizing SMILES are available via the website of the Swiss Institute of Bioinformatics, Lausanne, Switzerland [
21]. This website offers access to, e.g., to SwissADME program [
22,
23] which allows predicting properties that affect substance applicability as a drug. Another example of program utilizing the chemical code is WebMolCS [
24,
25] and other programs developed at the University of Bern, Bern, Switzerland [
26] within the Chemical Space Project [
17]. Apart from the above, free accessible programs, there are also commercial tools, such as JChem or MadFast [
27], both are provided by ChemAxon, Budapest, Hungary, utilizing chemical codes for database screening or calculations. There are also specialized peptide databases using the SMILES code, such as Brainpeps [
28,
29] or Quorumpeps [
30,
31], provided by the University of Ghent, Belgium; AHTPDB [
32,
33], CancerPPD [
34,
35], Hemolytik [
36,
37], ParaPep [
38,
39] or PepLife [
40,
41], provided by the Institute of Microbial Technology, Chandigargh, India. The above resources are integrated via the SATPdbmetabase [
42,
43]. Another example is BIOPEP database of sensory peptides and amino acids [
44,
45], provided by the University of Warmia and Mazury in Olsztyn, Poland. The number of programs and databases utilizing chemical codes successively increases. More links to such tools are available via metabases and metaservers [
46,
47,
48,
49,
50].
The bioinformatics approach concerning peptides involves, e.g., modeling structures and predicting interactions with biomacromolecules on the basis of amino acid sequences [
1,
51]. Structure modeling, involving amino acid sequences, may be performed using programs such as PepstrMod [
15,
16], Pep-Fold [
52,
53] (provider: University of Paris, Diderot, Paris, France), or (PS)
2.v3 (provider: National Chiao Tung University, Hsinchu, Taiwan) [
54,
55]. For instance, the Quantitative Structure-Activity Relationship (QSAR) approach involves a set of parameters that are describing the structure and physicochemical properties of particular amino acid residues [
56,
57]. The sequence-based approach is expanded using pseudo-amino acid composition [
58,
59,
60]. The application of chemical information for annotation of peptides and for processing their structures may enlarge the array of tools available for peptide research in silico.
Cheminformatics tools cannot, however, be considered and used uncritically as ”black boxes”. Many published datasets contain errors. Users or curators of databases and programs using chemical information should be prepared to recognize and correct possible errors [
61,
62,
63,
64]. Validation of representations, identification, and correction of mislabeled compounds is recommended as one of the crucial steps of compound dataset preparation and curation [
2,
63]. The preparation of peptide datasets, involving translation from amino acid sequences into chemical codes, is not an exception. Peptide data, annotated using chemical information codes, requires careful inspection before use.
The aim of this review article is to present practical solutions concerning translation of peptide annotation from biological into chemical language and correction of possible errors using contemporary software (with special attention of non-commercial programs) without extensive historical background. Proposed recommendations are based on our experience with completion and curation of the BIOPEP database of sensory peptides and amino acids [
44,
45] and MetaComBio website (University of Warmia and Mazury in Olsztyn, Poland) [
48,
49].
4. Brief Recommendations for Training
The importance of training in cheminformatics has recently been emphasized by Tetko and co-workers [
128]. They pointed out that cheminformatics needs experience in two areas; chemistry and informatics. In the case of peptide science, the users of existing databases and/or software (such as authors of this article who are not informaticians) need skills in two areas: chemistry and biochemistry with molecular biology. Apart from skills in informatics, development of new software requires sufficient experience in the two above-mentioned areas. Recommendations below concern information from the area of chemistry, which is sufficient in the approach that is mentioned in this publication. Information from this area may be important for biologists using amino acid sequences and abbreviation-based codes in describing other biomolecules (sugars, nucleotides, and acylglycerols).
The checklist of skills includes training in use of a molecular editor, taking into account molecule drawing, input and output options, as well as strong and weak points. The second step involves knowledge of all details of peptide structures, including protein, non-protein and non-natural amino acids as well as modifications (e.g., sugar or lipid moieties). Special attention should be paid to the configuration of substituents around asymmetric carbon atoms. Some amino acids contain more than one chirality center (e.g., C3 atoms in isoleucine and threonine). Additional moieties may also contain asymmetric carbon atoms. Configuration around some asymmetric carbon atoms in biologically-active molecules, such as acylglycerols or sugars, may be described in different languages. IUPAC recommends the use of absolute configuration (R or S) in systematic names of chiral compounds. Attention should thus be paid to the translation of stereospecific numbering of acylglycerols (sn-1 and sn-3) or nomenclature of anomeric carbon atoms in sugar residues (α or β) into absolute configuration of asymmetric carbon atoms (atom sn-2 in acylglycerols or anomeric atoms in sugars). User of this software should recognize its strong and weak points, as well as the possible errors (e.g., generation of structures containing atoms with inappropriate valence, missed chirality centers or reversed configuration around asymmetric carbon atoms). In the case of using few programs, it is important to check co-operation between them.
The training scheme involves options and protocols that are described in the previous sections. The initial training dataset should contain simple compounds, mainly di- and tripeptides built from typical protein amino acids. They are usually annotated in general, chemical databases, such as PubChem, ChemSpider, and ChEMBL, and specialized peptide databases, such as BIOPEP or AHTPDB. Chemical codes that are generated during training may thus be easily verified. We confirm our previous recommendation [
3] to try more databases. Simple di- and tripeptides are usually well described, but representations of more complex moieties (e.g., sugars) may contain errors. They may be recognized and removed via the confrontation of data from few databases. Sometimes, the creation of an appropriate structure representation requires a few attempts. For instance, any asymmetric carbon atom may occur in two possible configurations: R or S. The first attempt to obtain the correct one with the help of the molecular editor is successful with 50% likelihood (for one asymmetric carbon atom). The configuration of asymmetric carbon atoms is usually displayed in PubChem, or in names that are generated by Chemical Identifier Resolver, although for sugar moieties the numbering of carbon atoms, fulfilling IUPAC recommendations designed for heterocyclic compounds, may be confusing.