SeMPI 2.0—A Web Server for PKS and NRPS Predictions Combined with Metabolite Screening in Natural Product Databases

Microorganisms produce secondary metabolites with a remarkable range of bioactive properties. The constantly increasing amount of published genomic data provides the opportunity for efficient identification of biosynthetic gene clusters by genome mining. On the other hand, for many natural products with resolved structures, the encoding biosynthetic gene clusters have not been identified yet. Of those secondary metabolites, the scaffolds of nonribosomal peptides and polyketides (type I modular) can be predicted due to their building block-like assembly. SeMPI v2 provides a comprehensive prediction pipeline, which includes the screening of the scaffold in publicly available natural compound databases. The screening algorithm was designed to detect homologous structures even for partial, incomplete clusters. The pipeline allows linking of gene clusters to known natural products and therefore also provides a metric to estimate the novelty of the cluster if a matching scaffold cannot be found. Whereas currently available tools attempt to provide comprehensive information about a wide range of gene clusters, SeMPI v2 aims to focus on precise predictions. Therefore, the cluster detection algorithm, including building block generation and domain substrate prediction, was thoroughly refined and benchmarked, to provide high-quality scaffold predictions. In a benchmark based on 559 gene clusters, SeMPI v2 achieved comparable or better results than antiSMASH v5. Additionally, the SeMPI v2 web server provides features that can help to further investigate a submitted gene cluster, such as the incorporation of a genome browser, and the possibility to modify a predicted scaffold in a workbench before the database screening.


Introduction
Microorganisms, such as bacteria and fungi, have always been subject to evolutionary pressure. The law of "survival of the fittest" has led to a remarkable diversity of strategies to overcome competitors. At the level of biosynthesis, microorganisms "learned" to produce a vast number of natural products that help them to survive [1]. These secondary metabolites (SMs) often possess biological activities, which can be exploited for pharmaceutical purposes. The molecular machinery for the production of SMs is encoded in gene assemblies organized in biosynthetic gene clusters (BGCs). For most known gene clusters, the produced metabolite is unknown and various approaches have been explored for the prediction of SM scaffolds [2]. However, for most known SMs, the encoding BGC has not yet been discovered.
A first general approach in connecting SMs and BGCs was made with our SeMPI v1 web server [3]. The server combined predictions of modular polyketides (PKs) with Table 1. Cross-validation results and detection thresholds for biosynthetic domains used in the pipeline. The detection thresholds are referred to as domT by the HMMER manual [15].  1 Specific dehydratase which is more commonly found in trans-AT PKS [16]. 2 Avoids false positive pyran synthase domain detection [16].

PKS and NRPS Domain Architecture
To improve the existing PKS domain patterns for module detection and extend the detection scope to NRPS and PKS-NRPS mixed modules, the BGCs in the MIBIG v2 were observed and categorized. A summary of the extracted modules is shown in Figure 1 and the 20 most common modules are listed in Table 2. A complete list with all 212 module arrangements is shown in Supplementary Materials Table S1.

Adenylation Domain Dataset
In order to improve the performance of the A-domain specificity prediction models, a novel, very large training dataset was generated. Overall, 1028 new A-domains could be collected, leading to 2145 sequences altogether. An overview of the phylogenetic composi-Metabolites 2021, 11, 13 5 of 27 tion of the dataset is shown in Figure 2. The largest amount of A-domain sequences could be collected for proteobacteria (642) and actinobacteria (510). The dataset for actinobacteria, cyanobacteria, and proteobacteria could be more than doubled. The fungus dataset, represented mainly by ascomycota, could only be increased by 11 domains. The complete domain list is provided in Supplementary Table S2. KS-tAT_d-DH-KR-cMT-ACP S PKS Enoyl 49

Adenylation Domain Dataset
In order to improve the performance of the A-domain specificity prediction models, a novel, very large training dataset was generated. Overall, 1028 new A-domains could be collected, leading to 2145 sequences altogether. An overview of the phylogenetic composition of the dataset is shown in Figure 2. The largest amount of A-domain sequences could be collected for proteobacteria (642) and actinobacteria (510). The dataset for actinobacteria, cyanobacteria, and proteobacteria could be more than doubled. The fungus dataset, represented mainly by ascomycota, could only be increased by 11 domains. The complete domain list is provided in Supplementary Table S2.

Random Forest Parameter Grid Search
Different parameters for the RF models were evaluated using a grid search based on the cross-validation described in Section 4.4.2. The best parameters for the RF models for both domain types are shown in Table 3.

Random Forest Parameter Grid Search
Different parameters for the RF models were evaluated using a grid search based on the cross-validation described in Section 4.4.2. The best parameters for the RF models for both domain types are shown in Table 3.

Classifier Performance
The RF models with the best parameters determined in Section 2.3.2 were benchmarked for individual substrate prediction as well as their overall performance. The individual performances per substrate class for the AT-domains are shown in Table 4 and for the A-domains in Table 5. As a general trend, it can be observed that the performances improve with the number of training samples. For example, the F1-scores for methylmalonyl and malonly predictions are much higher compared to substrates with fewer samples such as ethylmalonyl and methoxymalonyl. A similar tendency can be observed for the NRPS substrates. However, there are also a few exceptions from this trend. For example, 3,5-dihydroxy-phenyl-glycin, with only 12 training samples, has an F1-score of 1, and phenylalanine, with 122 training samples, can only be predicted with an F1-score of 0.4. The different metrics for the overall performance of each classifier are shown in Table 6. Macro-averaged scores, which take all classes equally into account, show poorer performance as compared to micro-averaged scores, for PKS and NRPS substrates. The AT-domain classifier shows a very high performance, with an accuracy of 0.95. Even the MCC-score, which is more suited to representing unbalanced classifiers, has a score of 0.91 for the AT-domain classifier. The A-domain classifier only reaches an accuracy of 0.76 (MCC: 0.74), which is still an acceptable performance given the large amount of possible substrates.

Detection of Postsynthetic Modifications
Pfam domains were correlated to specific PSMs and used to build regression models for PSM prediction. Correlating Pfam domains are listed in Table 7. For most domains, the Pfam description matches the synthesized PSM. For example, the involvement of the pyridine nucleotide-disulphide oxidoreductase (Pfam ID: PF07992.13) in disulfide synthesis or the diaminopimelate epimerase (Pfam ID: PF01678.18) in nitro group synthesis can be explained with the function of the specific domains. In a few cases, highly correlating domains could be detected, where the involvement in PSM synthesis is not evident from the description. For example, the role of PF12680.6 and PF00890.23 in spiroketal synthesis or PF06722.11 in nitro group synthesis should be further investigated.
The Pearson correlation coefficient is rather average for all observed domains, with less than 0.67. Hence, most PSMs cannot be predicted based on linear correlation to one specific domain. Therefore, regression models that take multiple domains as features were generated to predict the number of PSMs. The performances of the regression models are shown in Table 8. The models were compared to hypothetical baseline models that compute the R2-score (coefficient of determination) for the case in which no PSM would be predicted for each sample BGC. The regression models could outperform the baseline models in all cases. In particular, the addition of glyco, spiroketal, and 6-Ring-formation PSMs could greatly improve the prediction scope. SeMPI v2 and antiSMASH v5 were benchmarked based on the Tanimoto similarity of the predicted scaffolds to the true SMs of all MIBiG v2 BGCs. The average (arithmetic mean) Tanimoto similarities for each BGC type are shown in Table 9. Different scoring metrics were combined to allow for the best comparison of natural compounds to the predicted scaffolds. The ranking benchmark shown in Section 2.5.3 was used to determine the optimal weight for the mixed ranking score. The best weights and corresponding benchmark performance for each BGC type are shown in Table 10. The parameter configuration which led to the best Top 10 ranking for all BGC types was chosen for the final mixed score. The ranking performance of SeMPI v2 and antiSMASH v5 predictions was benchmarked based on the ranking of the MIBiG compounds. The percentages of true SMs ranked into the Top 10 and Top 50 similar compounds are shown in Table 11. The mixed score is based on the grid search performed in Section 2.5.2. Since antiSMASH v5 does not predict PSMs, the mixed score can only be used for SeMPI v2 predictions. The ranking performance of SMs predicted from both pipelines can still be compared when the Tanimoto similarity-or MCS-based ranking is considered.

Web Server Design and Features
The SeMPI v2 web server is available at http://sempi.pharmazie.uni-freiburg.de and has processed hundreds of genomic input files without error (except for wrong input files). The processed files include 200 complete streptomyces genomes and all PKS, NRPS, and PKS-NRPS hybrid BGCs from the MIBiG, which were added to the server as a database feature. The server has been running continuously since 26 March 2019, without incidents. A detailed description of its features and usage is provided at the help page of the web server. A selection of screenshots is shown in Figure 3.

Profile-HMMs and Detection Thresholds
The overall high F-scores above 0.95 indicate that the MIBiG-derived pHMMs are appropriate tools for BGC detection. Only the ACP and bACP domains are wrongly classified in some cases, but since the role of ACP and bACP is identical for the SeMPI v2 module detection algorithm, the misclassification does not pose a problem for the overall scaffold generation. The DHt profile-HMM could be improved, but the detection error is still rather low and the domain is only seldom found in a BGC. The MIBiG v2 contains

Profile-HMMs and Detection Thresholds
The overall high F-scores above 0.95 indicate that the MIBiG-derived pHMMs are appropriate tools for BGC detection. Only the ACP and bACP domains are wrongly classified in some cases, but since the role of ACP and bACP is identical for the SeMPI v2 module detection algorithm, the misclassification does not pose a problem for the overall scaffold generation. The DHt profile-HMM could be improved, but the detection error is still rather low and the domain is only seldom found in a BGC. The MIBiG v2 contains considerably less DHt domains (133) compared to DH (1469) and DH2 (414) domains.
The detection thresholds (domT) were derived from the minimum scores needed to detect the true domain in the MIBiG. It seemed reasonable to use the curated MIBiG as a reference for domain detection. Nevertheless, the threshold should be frequently updated with newly added BGCs in the database. The only core profile-HMM which is not derived from this analysis is the KSQ profile since it was not separately labeled in the GenBank annotation of the MIBIG. The profile-HMM was directly taken from the antiSMASH v5 pipeline. The same threshold as for the KS domain was used.

PKS and NRPS Domain Architecture
To assign biosynthetic module arrangements with corresponding SM building blocks, a detailed investigation of the extracted modules was performed.

The Position and Role of the Methyltransferases in PKS
A detailed investigation of the carbon methyltransferase (cMT) containing modules revealed that the cMT domain can be found primarily in three possible arrangements in a PKS module. First, the cMT can be positioned between the reduction domains. For most modules containing reduction domains, the cMT was integrated at a conserved position Only one example of a trans-AT module with three reductive domains is described in the MIBiG v2 (BGC0001106). Since this module does not include a cMT domain, the correct cMT position for this arrangement could not be inferred. The characteristic cMT position could also be observed for putative trans-AT modules, where the trans-AT docking domain could not be detected (e.g., BGC0000177 [KS-DH-KR-cMT-ACP]), which could be a further marker for trans-AT detection.
The module arrangements described above were integrated into the module detection algorithm. The corresponding molecular scaffolds were derived by the addition of a methyl at the α-position of the building block. This could be observed for cis-AT [17][18][19] as well as trans-AT [18,20]

modules.
A special case is represented by the methylmalonyl substrate, where the building block is already α-methylated. In this case, a second α-methylation is also possible (e.g., in the case of Kirromycin [21]). However, the reduction to the double bond or alkyl is chemically not possible for this scaffold. This is confirmed by the BGC encoding for epothilone (BGC0000989 [22]). In the eighth module, the arrangement of DH-MT-KR leads to an α-methylation of the methylmalonyl substrate, but the reductive domains are inactive.
The oxygen methyltransferase (oMT) could be found predominantly (16 times) in modules that only contain a KR domain (e.g., BGC0000976 [KS-AT-oMT-KR-ACP]). In those cases, the keto-group in β-position is first reduced to a hydroxyl group and subsequently methylated. The oMT domain could also be observed in modules lacking reduction do-mains. Trans-AT modules (e.g., BGC0001110), as well as cis-AT modules (e.g., BGC0000982) without reduction modules, have been observed harboring an oMT domain. Although biosynthesis routes for BGCs with these module types have been proposed (cis-AT [23] and trans-AT [11]), the function of the oMT domain in these modules is too specialized and the proposed biosynthesis routes are only hypothetically described in publications. Therefore, a general rule for the transformation performed by these module types was not included in the module detection algorithm. A summary of all PKS module arrangements detected by SeMPI v2 and corresponding building blocks is shown in Figure 4.

The Position and Role of the Methyltransferases in NRPS
The most prominent MT found in NRPS modules is the N-methylating type. The arrangement C-A-nMT-PCP can be observed 120 times in the MIBiG, being the most commonly occurring methylation. The nMT is located between the A and PCP domain in all observed cases, e.g., BGC0001971 (C-A-nMT-PCP), BGC0000461 (C-A-nMT-PCP-TE), BGC0000384 (A-nMT-PCP), and BGC0000326 (C-A-nMT-PCP-E). The nMT domain usually leads to methylation of the nitrogen in the peptide bond of the inserted building block (e.g., retimycin and rakicidin A [24]).
In very rare cases, a C-methyltransferase (cMT) is also observed (C-A-cMT-PCP-TE). This cMT is responsible for the methylation of the α-carbon of a thioproline building block in thiazostatin-like metabolites (e.g., BGC0001014, BGC0000324, and BGC0001801). This reaction is too specific to infer a general biosynthetic rule.

NRPS-PKS Hybrid Modules
A detailed investigation of PKS-NRPS mixed modules showed that, in all cases, the ACP/PCP domain led to the classification as a mixed module (e.g., C-A-ACP or KS-AT-KR-PCP). These domains are structurally related and share similar tasks, namely the transport of the building blocks. Therefore, it is probable that mixed forms exist which cannot be reasonably distinguished with pHMMs. In some cases, they might be wrongly detected (e.g., BGC0001699). In other cases, especially in PKS-NRPS mixed clusters, the ACP/PCP domain could represent the link between the PKS and the NRPS part of the cluster (e.g., BGC0000459). In regard to the building block generation, this domain swap can be disregarded, since the substrate carrying domains do not lead to modifications of the building block.

Scaffold Generation
Based on the module analysis, detection rules for module arrangements with known substrates were derived. For each module, a building block is generated using SMILES.

The Position and Role of the Methyltransferases in NRPS
The most prominent MT found in NRPS modules is the N-methylating type. The arrangement C-A-nMT-PCP can be observed 120 times in the MIBiG, being the most commonly occurring methylation. The nMT is located between the A and PCP domain in all observed cases, e.g., BGC0001971 (C-A-nMT-PCP), BGC0000461 (C-A-nMT-PCP-TE), BGC0000384 (A-nMT-PCP), and BGC0000326 (C-A-nMT-PCP-E). The nMT domain usually leads to methylation of the nitrogen in the peptide bond of the inserted building block (e.g., retimycin and rakicidin A [24]).
In very rare cases, a C-methyltransferase (cMT) is also observed (C-A-cMT-PCP-TE). This cMT is responsible for the methylation of the α-carbon of a thioproline building block in thiazostatin-like metabolites (e.g., BGC0001014, BGC0000324, and BGC0001801). This reaction is too specific to infer a general biosynthetic rule.

NRPS-PKS Hybrid Modules
A detailed investigation of PKS-NRPS mixed modules showed that, in all cases, the ACP/PCP domain led to the classification as a mixed module (e.g., C-A-ACP or KS-AT-KR-PCP). These domains are structurally related and share similar tasks, namely the transport of the building blocks. Therefore, it is probable that mixed forms exist which cannot be reasonably distinguished with pHMMs. In some cases, they might be wrongly detected (e.g., BGC0001699). In other cases, especially in PKS-NRPS mixed clusters, the ACP/PCP domain could represent the link between the PKS and the NRPS part of the cluster (e.g., BGC0000459). In regard to the building block generation, this domain swap can be disregarded, since the substrate carrying domains do not lead to modifications of the building block.

Scaffold Generation
Based on the module analysis, detection rules for module arrangements with known substrates were derived. For each module, a building block is generated using SMILES. The building block represents the attachment of two carbon atoms to the growing metabolite in the case of PKs. The positions are labeled as described in the article by Keatinge-Clay [25] and are shown in Figure 5. NRP building blocks add a peptide bond and a carbon atom to the growing chain. Analogous to PK, the NRP positions are referred to as β for the peptide bond and α for the added carbon with the side chain.
Metabolites 2021, 11, x FOR PEER REVIEW 14 of 28 Figure 5. Nomenclature for PK and NRP building blocks. R1 in alpha position is defined by the substrate choice of the AT-or A-domain. The α-position of PK substrates can be further modified co-synthetically by C-methyltransferases. The stereo configuration of the α-carbon of NRP substrates is defined by the presence of an E domain. The β-position of PK substrates can be further reduced. For NRP substrates, only the nitrogen atom, also referred to as β-position, can be methylated.
The PKS reduction pattern and putative methylation are created as described in Figure 4. For the building block generation, it is important to consider that the reductive domains act on the previous building block (not the one activated in the observed module) and therefore this modification needs to be assigned accordingly. There are two possible NRP modifications. First, the stereo configuration of the α-atom can be modified depending on the presence of an epimerization domain. Second, the nitrogen of the peptide bond can be methylated.
PKS loading models are defined by the presence of a KSQ domain or the absence of a KS domain. NRPS loading modules are defined by the absence of a condensation domain. Building blocks of loading modules are missing the carbon atom in β-position. Terminal modules for both cluster types are defined by the presence of a TE or TD domain. Additional to the β-and α-position, these modules add a carboxyl group (TE) or a hydroxyl group (TD) to the building block.
For each complete block, the building blocks are joined together to generate the final metabolite. An example of the scaffold generation for a hypothetical mixed cluster is shown in Figure 6. The PKS reduction pattern and putative methylation are created as described in Figure 4. For the building block generation, it is important to consider that the reductive domains act on the previous building block (not the one activated in the observed module) and therefore this modification needs to be assigned accordingly. There are two possible NRP modifications. First, the stereo configuration of the α-atom can be modified depending on the presence of an epimerization domain. Second, the nitrogen of the peptide bond can be methylated.
PKS loading models are defined by the presence of a KSQ domain or the absence of a KS domain. NRPS loading modules are defined by the absence of a condensation domain. Building blocks of loading modules are missing the carbon atom in β-position. Terminal modules for both cluster types are defined by the presence of a TE or TD domain. Additional to the βand α-position, these modules add a carboxyl group (TE) or a hydroxyl group (TD) to the building block.
For each complete block, the building blocks are joined together to generate the final metabolite. An example of the scaffold generation for a hypothetical mixed cluster is shown in Figure 6.

PKS and NRPS Module Order
The correct order of modules in clusters that deviate from the co-linearity principle is difficult to predict. In some cases, the correct order can be derived by arranging starting modules at the beginning of a block and terminal modules at the end. However, for clusters composed of multiple disjointed modules, this simple approach does not suffice.
Although docking domains for PKS allow one to infer the order of modules in some cases, in many clusters, these domains are missing or cannot be identified with the existing profile-HMMs. In the study by Yadav et al. [26], 17 clusters were used to demonstrate the possibility of predicting the module order based on docking domains. A larger study has not been performed yet. Moreover, additional structurally different docking domains (class 2) have been described [27], further complicating the prediction of PKS module order. To the best of our knowledge, for NRPS and hybrid BGCs, no such study has been performed yet.
Since clear ab-initio rules are missing, SeMPI v2 infers the order of modules by comparing the overall domain architectures with known clusters. This approach could lead to correct module compositions for various clusters including nigericin (BGC0000114), incednine (BGC0000078), fluvirucin B2 (BGC000157), and even meilingmycin (BGC0000093)-which is composed of a very complex composition of six disjointed blocks. A final assessment of the method would require a large selection of disjointed BGCs, where the correct order is derived from the literature. Since such a collection is not yet available, the ordering algorithm was measured indirectly based on the quality of the predicted scaffolds (see Sections 2.5.1 and 2.5.3).

Adenylation and Acyltransferase Substrate Specificity
In the last decade, various methods have been proposed for the prediction of the substrate specificity of AT- [28][29][30][31][32] and A-domains [30,31,[33][34][35][36]. For each study, different datasets with known substrates have been collected and used to implement a prediction algorithm or train machine learning models. Although the studies tried to compare their predictive accuracy against each other, in most cases, this comparison is of little significance, since the used datasets are not identical. Different cross-validation methodologies

PKS and NRPS Module Order
The correct order of modules in clusters that deviate from the co-linearity principle is difficult to predict. In some cases, the correct order can be derived by arranging starting modules at the beginning of a block and terminal modules at the end. However, for clusters composed of multiple disjointed modules, this simple approach does not suffice.
Although docking domains for PKS allow one to infer the order of modules in some cases, in many clusters, these domains are missing or cannot be identified with the existing profile-HMMs. In the study by Yadav et al. [26], 17 clusters were used to demonstrate the possibility of predicting the module order based on docking domains. A larger study has not been performed yet. Moreover, additional structurally different docking domains (class 2) have been described [27], further complicating the prediction of PKS module order. To the best of our knowledge, for NRPS and hybrid BGCs, no such study has been performed yet.
Since clear ab-initio rules are missing, SeMPI v2 infers the order of modules by comparing the overall domain architectures with known clusters. This approach could lead to correct module compositions for various clusters including nigericin (BGC0000114), incednine (BGC0000078), fluvirucin B2 (BGC000157), and even meilingmycin (BGC0000093)-which is composed of a very complex composition of six disjointed blocks. A final assessment of the method would require a large selection of disjointed BGCs, where the correct order is derived from the literature. Since such a collection is not yet available, the ordering algorithm was measured indirectly based on the quality of the predicted scaffolds (see Sections 2.5.1 and 2.5.3).

Adenylation and Acyltransferase Substrate Specificity
In the last decade, various methods have been proposed for the prediction of the substrate specificity of AT- [28][29][30][31][32] and A-domains [30,31,[33][34][35][36]. For each study, different datasets with known substrates have been collected and used to implement a prediction algorithm or train machine learning models. Although the studies tried to compare their predictive accuracy against each other, in most cases, this comparison is of little significance, since the used datasets are not identical. Different cross-validation methodologies and classification metrics further hinder the comparability of substrate classification models.
Despite these difficulties, it could be observed that most substrate prediction tools reach rather similar performances, especially concerning the most common substrates.
For AT-domain substrate predictions with a rather small substrate choice, the accuracy reaches more than 90%. For example, Khayatt [37]. A-domain predictions which are more challenging-due to their large substrate rangereached F1-scores (micro or macro) between 0.7 and 0.8. The latest study, SANDPUMA [36], could demonstrate that various methods achieve this performance. However, it should be mentioned that the demonstrated ensemble method is slightly better than the individual classifiers.
The RF models used for substrate prediction in SeMPI v2 yielded comparable performances with the aforementioned state-of-the-art tools. The AT substrates could be predicted with an accuracy of 95% and the A-domain specificity classification reached a micro F1-score of 0.76.
The observation of a performance peak in substrate predictability allows the assumption that further improvements should be focused on the training dataset, rather than on the used algorithm. It can be hypothesized that various algorithm-independent parameters are responsible for the difficulty of substrate specificity classification. First, substrate variability hinders the predictability of one outcome class [38]. Second, substrate classes with few known instances are difficult to predict. Finally, since the training datasets are manually collected, wrongly assigned specificities can further impair the predictive power.
Therefore, a strong focus was placed on increasing the A-domain training dataset for the SeMPI v2 classifier. The dataset collected by Prieto et al. [34] comprised the largest collection of annotated A-domain sequences (1598). Surprisingly, even the dataset used for the SANDPUMA algorithm-which is composed of the dataset collected by Khayatt et al. (494) and newly annotated sequences (434)-was smaller than the Prieto dataset (928). Consequently, we decided to use the Prieto dataset as a basis and added manually annotated sequences, which allowed us to generate the largest dataset of annotated Adomain sequences of overall 2145 samples.

Detection of PSMs
PSMs are more challenging to predict than the rather straightforward linear assembly lines of the PKS and NRPS core modules. This is partly because the involved domains are rather arbitrarily located in proximity to the BGC. However, the main reason can be attributed to missing learning data. Although hundreds of BGCs and their SMs are stored in the MIBiG, only a few instances of each PSM can be found-the only exception being carbohydrate scaffolds, where various examples are known, but the type and number of glycosides vary greatly between clusters.
Therefore, the scope of PSMs in SeMPI v2 is mainly targeted by the subsequent database screening. Nevertheless, for some PSMs where at least a few instances are described, a prediction was attempted. Eventually, the addition of PSMs into the screening process can bring researchers closer to the true product of a BGC. The correlation analysis yielded various Pfam domains that were positively correlated to a given PSM. The regression models built based on these domains could predict the number of PSMs for a cluster with a high R 2 -score.
For carbohydrate scaffolds, only the number of glycosides is predicted since the prediction of the glycoside type depends on complex domain arrangements. To the best of our knowledge, only PRISM v3 [39] provides a combinatorial approach for sugar prediction, although a detailed benchmark of the method is missing.

Similarity of the Prediction to the True Secondary Metabolite
The comparison of the predicted scaffolds with the true metabolite shows that the predictions made by SeMPI v2 are slightly better compared to the antiSMASH v5 predictions. The best predictions are made for NRPS products (mean Tanimoto similarity: 0.57), which shows that the A-domain specificity classification of SeMPI v2 can compete with the SANDPUMA algorithm implemented in antiSMASH v5.
The rather average Tanimoto similarity between prediction and true SM for both tools can be seen as a general indicator of how much structural information is still missing in the prediction scope. Some of this structural information gap can potentially be completed by comparison of the prediction with similar, already described compounds.

Ranking of the True Metabolite
For the NRPS products, the MCS-based ranking yielded rather similar performance for antiSMASH v5 (Top 10: 45.9%) and SeMPI (Top 10: 45.1%) predictions. However, for the PK dataset, the MCS-based scoring performed much better for SeMPI v2 predictions (Top 10: 36.0%) as compared to antiSMASH v5 (Top 10: 22.5%) predictions. Since the similarity benchmark did not show such a discrepancy, this observation was further investigated.
The careful observation of various PK predictions showed that antiSMASH v5 predictions have an offset concerning the reduction patterns of PK products. Furthermore, in the observed scaffolds, the double bonds were assigned at the wrong position. A demonstration based on the abyssomicin BGC is shown in Supplementary Figure S1. This wrong assignment led to false matching in the MCS algorithm for antiSMASH v5 predictions. This observation could demonstrate the advantage of the MCS-based ranking as opposed to Tanimoto similarity-based ranking. To further increase the ranking power, the three ranking scores were combined to a mixed score, which yielded the overall best possible ranking. This mixed score ranked 47% of the true SM into the Top 10.
The NRPS-PKS hybrid BGCs led to the poorest rankings in the benchmark. This is not surprising, since NRPS-PKS mixed BGCs can lead to very complex scaffolds, which are difficult to match with the linear predictions. Nevertheless, 40% of the mixed products could be ranked in the Top 10. Considering the Top 50 ranked compounds, which is the default output on the web server, 66% of the true SMs were matched.
The ranking benchmark demonstrates that researchers aiming to identify BGC products can gain significant support by using the SeMPI v2 database screening approach.

PKS and NRPS Profile-HMMs
The widely used domain annotations from the MIBIG database v2 [14] were taken as a basis for constructing new profile-HMMs for the SeMPI v2 pipeline. The domain names were based on the commonly used antiSMASH [4,11,16,40,41] nomenclature.
To evaluate the detection sensitivity of the created profiles, a benchmark was designed where the profiles were 5-fold cross-validated. Precision, recall, F1-score, and the support for each profile are reported. Additionally, the detection threshold (domain-specific e-value as reported by the HMMER software [15,42], referred to as domT) was adjusted for each profile-HMM. The minimum threshold required for the detection of the correct domains in the MIBIG v2 was used for each profile-HMM in the SeMPI v2 pipeline.

PKS and NRPS Module Architecture
Domains that are jointly responsible for the incorporation and co-modification of a new building block in the synthesis of a SM are defined as modules. To systematically investigate known module architectures and their associated building blocks, NRPS, PKS, and PKS-NRPS hybrid modules were extracted from the MIBIG v2 database. Specific properties which allowed us to examine and categorize the modules were computed: the number of occurrences of the module in the MIBiG, the MIBiG accession IDs, the order of the domains, the occurrence of methylation domains, the reduction or epimerization profile, and the possibility of a trans-acting AT.
The modules were categorized into three groups. First, the normal modules comprised domain sets for which complete building blocks could be derived. These modules contained either an A-domain or an AT-domain, allowing classification of the substrate and co-synthetic modifications of the building block. For the PKS system, modifications comprised the reduction profile as well as possible methylation of the substrate. For the NRPS system, the stereo-configuration of the α-carbon was determined as well as possible methylations. Additionally, for releasing modules, the possible reduction of the carbonyl group was determined.
Second, the special modules were defined as modules for which building blocks could only be partly defined. For these modules, either the substrate specificity could not be predicted (e.g., trans-AT) or the co-synthetic modifications could not be inferred (e.g., PKS modules with an unusual reduction domain arrangement, such as KS-AT-DH-ACP).
Third, the non-functional modules comprised all domain combinations which did not allow us to determine a possible building block, because crucial domains for the function of the module were missing (e.g., DH-ACP, C-PCP, KR-ACP).
Based on the functional and special modules, patterns were derived, which can be used to screen novel BGCs for known modules.

PKS and NRPS Module Order
An algorithm was designed which allows us to score possible module combinations in a BGC based on known BGC arrangements.

Database of Clusters with Known Module Order
The MIBiG v2 BGCs modules were combined into blocks. Each block represented a unit in which the modules were arranged in succession. Only modules that were located directly next to each other on the same DNA strand were combined into blocks. The blocks were split up if a terminal module (harboring a TE or TD domain) was not located at the end position of the block.
Those BGCs in the MIBiG v2 database, where exactly one block could be defined, were taken as reference BGCs. The reference database comprised 252 BGCs.

Module Order Scoring for Novel Clusters
For a novel BGC, the modules are ordered into blocks. For all blocks in a BGC, an allvs.-all matrix is computed. For each pair in the matrix, an artificial domain set is generated, representing hypothetical combined blocks. Only reasonable block combinations are considered. For example, starting domains can only be at the beginning of a block pair and terminal domains can only be at the end of a pair. The generated domain combinations are aligned with the reference BGCs and scored.
The alignment is performed using the Smith-Waterman algorithm [43] for local alignments. The parameters are matched domains yield 1 point, unmatched domains cost 1 point, gap opening costs 2 points, and gap extension costs 1 point. The block pairs which yield the highest scores are combined into the final blocks. If not all blocks can be combined (e.g., if two terminal domains are present), the blocks are left disconnected. The module order algorithm is demonstrated in Figure 7.

Sequence Datasets
Sequences of PKS AT-domains with known substrate specificity were taken as described in SeMPI v1 [3]. The initial set of annotated NRPS A-domain sequences was taken from Prieto et al. [34]. This dataset comprised 1598 annotated A-domains. The dataset was used to generate a profile-HMM [15,42,44]. The profile-HMM was used to screen the entire MIBiG v2 [14] and collect a novel set of A-domain sequences. The sequences collected by Prieto et al. and the newly collected sequences were aligned using MAFFT (for multiple alignment using fast Fourier transform) [45]. Duplicated sequences with an identity of 100% were removed. Substrate specificities for the new sequences were inferred from the scientific literature of described metabolites.
For both sequence datasets, only sequences where at least 10 known specificities could be assigned were used. The A-domain dataset was further trimmed to only include unambiguous specificities. The final datasets comprised 500 AT-domains and 2145 A-domains. The A-domains were furthermore phylogenetically categorized. Therefore, the taxonomy for each sequence was derived from UniProt [46].

Specificity Prediction
The sequences were used to train random forest (RF) classifiers. The sequences were aligned using MAFFT. The alignments were transformed into an array that could be used as a feature set to train RF models. Each position in the alignment can be interpreted as a categorical feature. The RF models were implemented using scikit-learn [47]. The scikitlearn models require binary features; hence, the categorical feature space was transformed into binary features using one-hot encoding (also referred to as dummy encoding). The encoding scheme and model building are visualized in Figure 8.

Sequence Datasets
Sequences of PKS AT-domains with known substrate specificity were taken as described in SeMPI v1 [3]. The initial set of annotated NRPS A-domain sequences was taken from Prieto et al. [34]. This dataset comprised 1598 annotated A-domains. The dataset was used to generate a profile-HMM [15,42,44]. The profile-HMM was used to screen the entire MIBiG v2 [14] and collect a novel set of A-domain sequences. The sequences collected by Prieto et al. and the newly collected sequences were aligned using MAFFT (for multiple alignment using fast Fourier transform) [45]. Duplicated sequences with an identity of 100% were removed. Substrate specificities for the new sequences were inferred from the scientific literature of described metabolites.
For both sequence datasets, only sequences where at least 10 known specificities could be assigned were used. The A-domain dataset was further trimmed to only include unambiguous specificities. The final datasets comprised 500 AT-domains and 2145 Adomains. The A-domains were furthermore phylogenetically categorized. Therefore, the taxonomy for each sequence was derived from UniProt [46].

Specificity Prediction
The sequences were used to train random forest (RF) classifiers. The sequences were aligned using MAFFT. The alignments were transformed into an array that could be used as a feature set to train RF models. Each position in the alignment can be interpreted as a categorical feature. The RF models were implemented using scikit-learn [47]. The scikitlearn models require binary features; hence, the categorical feature space was transformed into binary features using one-hot encoding (also referred to as dummy encoding). The encoding scheme and model building are visualized in Figure 8. To determine the best parameters for the RF classifiers, an extensive parameter grid search was performed. The parameters are listed in Table 12. All parameter combinations resulted in overall 2160 configurations. For each parameter combination, a model was built and evaluated using 10-fold cross-validation. The model performance was evaluated using the micro F1-score. For the models with the best configuration, the performance for individual substrates was further evaluated. To allow comparison with other methods, various metrics for the cross-validation were computed: accuracy, precision, recall, F1score, area under the receiver operating characteristic curve (ROC-AUC), error rate, and Matthews correlation coefficient (MCC). Table 12. RF parameters and their configuration used for the grid search to determine the best configuration for substrate specificity prediction. Each parameter is described in detail in the scikit-learn documentation [48]. Precision, recall, and F1-score were computed individually for each class. To evaluate the power of the entire model, the metrics were also averaged using micro-, macro-, and weighted-averaging. Similarly, micro-and macro-AUC scores were computed.

Detection of Postsynthetic Modifications
The GenBank annotation of the MIBIG database v2 was used to extract the number of non-overlapping Pfam domains present in each BGC. The scaffolds of the produced SMs were screened for putative PSMs using the substructure search as implemented in RDKit [49]. SMARTS patterns used to match the PSMs are shown in Table 13. To determine the best parameters for the RF classifiers, an extensive parameter grid search was performed. The parameters are listed in Table 12. All parameter combinations resulted in overall 2160 configurations. For each parameter combination, a model was built and evaluated using 10-fold cross-validation. The model performance was evaluated using the micro F1-score. For the models with the best configuration, the performance for individual substrates was further evaluated. To allow comparison with other methods, various metrics for the cross-validation were computed: accuracy, precision, recall, F1score, area under the receiver operating characteristic curve (ROC-AUC), error rate, and Matthews correlation coefficient (MCC). Table 12. RF parameters and their configuration used for the grid search to determine the best configuration for substrate specificity prediction. Each parameter is described in detail in the scikitlearn documentation [48].

Parameter Configurations
Max Precision, recall, and F1-score were computed individually for each class. To evaluate the power of the entire model, the metrics were also averaged using micro-, macro-, and weighted-averaging. Similarly, micro-and macro-AUC scores were computed.

Detection of Postsynthetic Modifications
The GenBank annotation of the MIBIG database v2 was used to extract the number of non-overlapping Pfam domains present in each BGC. The scaffolds of the produced SMs were screened for putative PSMs using the substructure search as implemented in RDKit [49]. SMARTS patterns used to match the PSMs are shown in Table 13. The number of Pfam domains per BGC was correlated with the number of substructures in the related metabolites using the Pearson correlation coefficient. The positively correlating domains were further investigated to derive their function in the PSM. The investigation was based on the Pfam description. For example, domains that correlated with sugar scaffolds were examined for sugar biosynthesis association, such as glycosyltransferase activity. The selected domains were used to build logistic regression models with scikit-learn. The models were evaluated with 5-fold cross-validation. The performance was measured using the R 2 -score (coefficient of determination) [50]. As a baseline, the R 2 -score was also computed for a hypothetical model, assuming no PSMs are predicted at all. The models which performed better than the baseline were incorporated into the prediction pipeline. The predicted PSMs were added as additional features to the predicted scaffold.  [4], NANPDB [5], ChEBI [6], and Drugbank [7] molecules were retrieved via their web interface in SDF format. ChEBI contains manually curated and automatically collected molecules. Only the manually curated ones were taken. The MIBiG v2 molecules were extracted from JSON files downloaded via the web interface. Molport molecules were retrieved from the Molport FTP server [9]. Only those molecules which were marked as natural products (NPs) in the database were taken. Analyticon [8] molecules were retrieved via their ZINC [51] deposit. Norine [10,52,53] peptides were downloaded via the REST API of the web interface.
Furthermore, a collection of streptomyces genomes were processed using SeMPI v2, and the results were stored as a database feature on the web server. The predicted BGC products were also included as a possible natural compound screening option. This option allows one to detect the putative BGC for a given streptomyces natural product. The streptomyces genomes included all complete genomes associated with compounds stored in the StreptomeDB v3 [54]. Whereas the database screening allows for the detection of putative SMs for a submitted BGC, the addition of preprocessed BGCs also enables reverse screening, thus allowing one to connect SMs with their putative BGCs. A summary of the databases is shown in Table 14.
The molecules were parsed using RDKit and incorporated into a PostgreSQL [55] database. Duplicated SMILES were removed from the database. The SMILES were converted into circular Morgan fingerprints (radius: 3) using the RDKit database cartridge. The fingerprints were indexed using the Generalized Search Tree (GiST) [56] algorithm. The database set-up achieves a Tanimoto similarity search of the predicted scaffolds with more than 180,000 molecules in less than a second. The Tanimoto similarity search allows the selection of a group of closely related scaffolds to the query prediction. However, especially repeating patterns such as PK and NRP scaffolds do not allow for good scaffold comparison using molecular fingerprint-based algorithms. Therefore, an additional most common substructure (MCS)-based scoring algorithm was implemented. The MCS algorithm scores the predicted molecules with the compounds in the database, by detection of a MCS. The score is computed by dividing each matching atom and bond in the predicted molecule and the target molecule by the total number of atoms and bonds. Because the MCS algorithm is very time-consuming, it is by default only performed on a preselection of 50 compounds from the similarity search. The RDKit implementation only allows the generation of the MCS between one query and one target molecule. To allow for the matching of incomplete BGCs, the algorithm was extended to allow for the matching of multiple query molecules (e.g., multiple disconnected building blocks) in the natural compound targets. The extended MCS scoring algorithm is demonstrated in Figure 9.
A third scoring algorithm was implemented to cover the influence of PSMs on the ranking of the metabolites. This scoring function performs a substructure search using RDKit for the predicted PSMs on the preselected natural compounds. The match between the PSMs in the predicted scaffold and the natural compound is computed using the Bray-Curtis similarity [57]. The Bray-Curtis similarity is defined as: where u is the array of PSMs found in the predicted compound and v is the array of PSMs found in the natural compound from the database. This similarity score was chosen since the interpretation is rather intuitive: if both molecules have the same number of PSMs, the score is 1; the greater the distinction, the closer the score is to 0.
The scores are used to rank the natural products to present a selection of possible complete SMs that might be produced by the observed BGC. For the final ranking, a mixed score was designed where the individual scores are added with different weights: Mixed-score (a,b,c) = a × Similarity-score + b × MCS-score + c × PSM-score. ( To determine the weight of each score to allow for the best possible ranking performance, a grid search was performed. Each score was weighted from 0 to 100% in steps of 10%, leading to overall 3992 parameter configurations. The performance of each parameter configuration was measured using the MIBiG-based ranking benchmark. the predicted molecule and the target molecule by the total number of atoms and bonds. Because the MCS algorithm is very time-consuming, it is by default only performed on a preselection of 50 compounds from the similarity search. The RDKit implementation only allows the generation of the MCS between one query and one target molecule. To allow for the matching of incomplete BGCs, the algorithm was extended to allow for the matching of multiple query molecules (e.g., multiple disconnected building blocks) in the natural compound targets. The extended MCS scoring algorithm is demonstrated in Figure 9.

Secondary Metabolite Prediction and Ranking Benchmark
Following the benchmark for SeMPI v1 [3], where the ranking of 40 BGC products was used as a metric to evaluate the pipeline performance, a novel benchmark was designed. For this benchmark, all PKS, NRPS, and PKS-NRPS hybrid BGCs in the MIBiG v2 with known products were collected. A total of 559 clusters were evaluated.
Two metrics were defined: First, the Tanimoto similarity (circular Morgan fingerprint, radius: 3) of the predicted scaffold with the known product of each cluster was computed. Second, all MIBiG v2 products of the benchmark BCGs were ranked based on the scoring metrics defined in Section 4.6.2. On the web server, the MCS-based and mixed-score-based rankings are only computed for preselected compounds, but for the benchmark, the scores were computed for all metabolites. In total, 828 metabolites were ranked, which is more than the total number of BGCs in the benchmark, since for some clusters, multiple possible products are described (e.g., abyssomicin C and atrop-abyssomicin C for BGC0000001). The rank of the true SM was used as a metric. Since the pipeline reports the Top 10 and Top 50 matching candidates, the percentage of true SMs ranked in this category was calculated.
To the best of our knowledge, the only other pipeline which produces mostly one predicted scaffold per BGC and is actively maintained is antiSMASH v5. We used the predicted scaffolds of antiSMASH v5 [16] as a comparison to our pipeline. For the natural compound ranking metric, the scoring scheme had to be adjusted, since antiSMASH v5 does not predict PSMs. Therefore, the comparative benchmark was performed on the predicted scaffolds without PSM ranking. The number of predicted BGCs for both pipelines was not the same for each cluster type. Multiple factors contributed to this discrepancy. First, the cluster type classification differs between pipelines. For example, SeMPI v2 classifies the clusters only based on the presence of functional modules, whereas we assume that the antiSMASH v5 classification is based on any biosynthetic domain found in the BGC. Second, in some rare cases, antiSMASH v5 would split the cluster query into multiple BGCs. These BGCs were excluded. Third, in some cases, either antiSMASH v5 or SeMPI v2 did not predict scaffolds. Nevertheless, for most BGCs, scaffolds were predicted by both pipelines. To show a fair comparison, a dataset was generated in which the predictions of both pipelines were merged. BGCs which were not predicted by either tool were excluded from this dataset. The cluster type classification from SeMPI v2 was used for all predictions. For some BGCs, antiSMASH v5 produced multiple cluster variants-for these instances, the mean of the similarity was taken.
Other tools are difficult to compare to our pipeline since the output characteristics are significantly different and hinder comparability. SBSPKS v2 [58] only predicts the products for domains encoded by one gene, not for entire clusters. Moreover, SBSPKS v2 can only process a maximum of 10 proteins via its web interface, which makes the processing of over 500 clusters very time-consuming. PRISM v3 produces a collection of scaffold permutations for most clusters. Our benchmark was designed to score single predicted scaffolds; hence, PRISM v3 was not used as a comparison. Moreover, we experienced various technical problems when submitting multiple BGCs in FASTA format to the PRISM v3 pipeline.
The benchmark was also used to determine the optimal weight of the three different scoring metrics. Therefore, a grid search was performed, where each of the scores was weighted differently.

Pipeline and Web Server Implementation
SeMPI v1 was designed as an extension to antiSMASH v3 [11], with the main goal of predicting PKs with deviations from the co-linearity principle as well as implementing a database screening in the StreptomeDB v2.0. To create an independent and flexible tool, SeMPI v2 was completely redesigned. SeMPI v2 performs the entire cluster prediction and scaffold generation independently.

Input
The input must be provided in FASTA or GenBank format. The files can contain multiple records. SeMPI creates one result page for each record. If only DNA data are provided, the genes are predicted using Prodigal [59]. If the genes are already assigned, SeMPI can try to parse the genes and use them for further analysis.

Cluster Prediction
The biosynthetic domains are detected using the profile-HMMs and thresholds described in Section 2.1. The proteins and their characteristics-such as position, gene locus, and strand direction-are transferred into a Pandas [60] DataFrame. The DataFrame allows very fast vectorized operations. The processing of an entire genome including database screening for similar natural compounds takes only a few minutes.
The modules are detected based on the rules described in Section 3.1. The most probable order of modules is generated using the module order algorithm described in Section 4.3. The substrates for each module are predicted using the RF models for ATand A-domains. Based on the module characteristics and the substrate specificity, the corresponding building blocks are generated. These building blocks are combined into the putative metabolite scaffolds.
For each scaffold, a set of PSMs is generated using the profile-HMMs and regression models as described in Section 4.5.

Metabolite Ranking
The predicted scaffolds are submitted to the database screening. The screening is performed using a user-selected choice of publicly available databases and/or a custom uploaded database (SDF or SMILES format). A custom database could be, for example, a set of bioactive samples that are assumed to be produced by a specific host. The screening can help to connect the samples with specific BGCs in the host.
The predicted scaffolds can also be modified and resubmitted to the natural compound screening. The molecular operations are performed using the JSME Molecule Editor [61]. This allows expert users to add certain modifications and substructures which are not yet correctly predicted. The database screening can also be used independently of the cluster prediction. Any molecule can be submitted, modified, and resubmitted via the web interface.
This feature also allows the usage of the SeMPI v2 database screening backend in combination with other BGC prediction tools such as PRISM v3 and antiSMASH v5.

Web Server
The web server is built using Django [62] and a PostgreSQL database backend. Genomes are submitted via a simple input form. A real-time view shows the progress of the pipeline using Ajax [63,64]. The results are presented in the form of a data table [65]. The results include a summary page with detailed information about the domains and modules used to generate the clusters and an overview of the scaffolds for each BGC. Each BGC and also the entire genome can be investigated visually via the genome browser D3GB [66]. The browser shows the genes, domains, modules, blocks, and clusters.

Conclusions
The SeMPI v2 web server combines state-of-the-art predictions of PK and NRP scaffolds with a specialized screening in publicly available databases. This allows researchers to further connect the dots between BGCs and SMs. The information generated with SeMPI v2 is presented in a user-friendly web server frontend. The web server allows further customization of the predictions and thus more flexibility for expert users. BGCs can be observed in detail using the visually appealing integrated genome browser.
Since SeMPI v2 yields predictions that are complementary to existing pipelines, a comprehensive BGC analysis should include all available resources, such as antiSMASH v5, PRISM v3, SBSPKS v2, and others. Expert users can combine the generated structural features and submit multiple queries to the SeMPI v2 database screening backend to search for natural compound targets in different databases. In this role, SeMPI v2 contributes to further unraveling BGC mysteries and complements the link to putative SMs.

Data Availability Statement:
The data presented in this study is available in the supplementary material. The preprocessed databases are available on the web server.