Enhanced Hyperbox Classifier Model for Nanomaterial Discovery

Machine learning tools can be applied to peptide-mediated biomineralization, which is an emerging biomimetic technique of creating functional nanomaterials. In particular, they can be used for the discovery of biomineralization peptides, which currently relies on combinatorial enumeration approaches. In this work, an enhanced hyperbox classifier is developed which can predict if a given peptide sequence has a strong or weak binding affinity towards a gold surface. A mixed-integer linear program is formulated to generate the rule-based classification model. The classifier is optimized to account for false positives and false negatives, and clearly articulates how the classification decision is made. This feature makes the decision-making process transparent, and the results easy to interpret for decision support. The method developed can help accelerate the discovery of more biomineralization peptide sequences, which may expand the utility of peptide-mediated biomineralization as a means for nanomaterial synthesis.


Introduction
Peptide-mediated biomineralization is a promising bio-inspired technique for creating inorganic nanomaterials that possess functional properties [1]. This biomimetic technique has produced nanomaterials for a wide array of applications, such as highly efficient catalysts [2], plasmonic materials with tailorable optical properties [3], and bimetallic nanoparticles for electrooxidation [4], among others. These interesting materials were produced due to the ability of the biomineralization peptide to regulate the size, shape, and morphology of the nanomaterial. This level of control is achieved due to the binding affinity of the biomineralization peptide towards a particular material surface, which leads to the selective binding of the biomineralization peptide at specific faces of the growing nanomaterial [5,6]. The binding affinity of the peptide is known to be affected by peptide properties such as the oligomerization state [7], conformation [8], and sequence [9].
In order to broaden the utility of peptide-mediated biomineralization as an effective platform for nanomaterial synthesis, a method for the quick identification of biomineralization peptide sequences should be developed. Currently, the discovery of biomineralization peptides has mainly depended on the tedious laboratory assays which are combinatorial in nature. Since acquisition of experimental data is costly, typically only small data sets are available in this domain [10]. Thus, the use of modern computational tools such as artificial intelligence (AI), data mining, and machine learning (ML) for this purpose presents significant potential to address this problem since these tools have been used for

Methodology
The hyperbox model developed here is an extension of the work of Xu and Papageorgiou [25]. The original approach they developed used a mixed integer linear programming (MILP) model to generate non-overlapping hyperboxes to enclose clusters of data; additional hyperboxes can be determined iteratively to improve classification performance. Further algorithmic improvements were reported by Maskooki [26] and Yang et al. [27]. The presence of user-defined parameters makes the training of hyperbox models highly interactive. Expert inputs can be integrated into the procedure to augment small data sets with mechanistic domain knowledge. The resulting optimized hyperboxes constitute a rule-based model that can be used as a classification model. One of the key advantages of this technique is that the hyperboxes can be readily interpreted as IF-THEN rules, which can be used effectively and intuitively to support decisions [27].
The main features of the model extension developed here are as follows: • The model is a binary classifier such that a pre-defined number of hyperboxes are meant to enclose samples which positively belong to the group and to exclude samples which do not. The initial default assumption is that of negative classification, but the activation of at least one rule results in a positive classification. This approach eliminates the need for negative classification rules. The number of hyperboxes is user-defined and the training is done interactively with expert knowledge inputs to augment small data sets that are typical in this domain [19,20]. This consideration makes it possible to identify alternative rule sets which may make more sense to the expert or may improve the performance of the model. However, a balance should be made between generalizability and over-fitting. • A user-defined margin is utilized as the minimum distance needed to separate the negative samples from the boundaries of the resulting hyperboxes. Thus, each hyperbox actually consists of concentric inner and outer hyperboxes separated by the said margin. This feature serves the same purpose as the gap between parallel hyperplanes in SVMs.

•
The model accounts for Type I (number of false positives) and Type II (number of false negatives) errors such that a user may select which type of error should be minimized while indicating an upper limit for the other. This feature was introduced in an extension of the hyperbox model by Yang et al. [27].

•
The model is meant to define the dimensions of the hyperboxes to adequately classify the training data. Reduction of attributes can result in more parsimonious ML models and is thus regarded as an important feature during training [28]. These dimensions may extend to infinity, rendering some attributes to be insignificant for classification. This feature is achieved by enabling the model to remove the lower bound and/or upper bound of each hyperbox along each dimension, as needed.
The overall objective of the model is to minimize Type I errors, α, while ensuring that Type II errors, β, are kept below a defined threshold ε. Type I (α) and Type II (β) errors are then defined where j represents a sample in the training set S T . Minα Each sample j in the training data has performance level in attribute i with the value of X ji . The lower (x L ik ) and upper (x U ik ) limits along dimension i for hyperbox k are determined to enclose as many positive samples as possible. The dimensions of the outer hyperbox and the dimensions of the inner hyperbox are separated by the user-defined margin ∆.
The possibility of semi-infinite (i.e., having no lower limit or no upper limit) or infinite dimensions (i.e., having no lower and upper limits) for the hyperbox are also considered. In the latter case, the absence of lower and upper limits allows the hyperbox to be projected to a lower dimensional space, and removes the corresponding attribute from the associated decision rule.
For instances in which sample j lies outside the boundaries of the hyperbox, sample X ji may lie below the lower limit of hyperbox k in dimension i by more than the user-defined margin, ∆ (11) or X ji may lie above the upper limit of hyperbox k in dimension i by more than the user-defined margin, ∆ (12). If the sample satisfies any of the two conditions, then this indicates that the sample is outside the hyperbox and q L ijk or q U ijk will be equal to 1. Consequently, in such instances, the sample should be considered a negative sample and b jk = 0.
In instances in which there are multiple hyperboxes, a sample is said to be a positive sample, c j = 1, if it is enclosed by at least one hyperbox. Thus, each hyperbox corresponds to one rule, and the resulting classifier consists of a set of disjunctive rules. The relationship among a set of such rules can be represented by a Venn diagram as shown in Figure 1. Samples that do not activate any of the rules (i.e., are not enclosed by any of the hyperboxes) are classified as negative by default.
AI 2020, 1, FOR PEER REVIEW 4 the hyperbox and or will be equal to 1. Consequently, in such instances, the sample should be considered a negative sample and = 0.
In instances in which there are multiple hyperboxes, a sample is said to be a positive sample, = 1, if it is enclosed by at least one hyperbox. Thus, each hyperbox corresponds to one rule, and Finally, the binary variables are as follows: This MILP model can be solved to global optimality using the branch-and-bound algorithm, which is available as a standard feature in many commercial optimization software packages. Alternative solutions (rule sets) can be determined for any given MILP using standard integer-cut features in such software; additional solutions can also be found by adjusting the model parameters. Figure 2 shows the interactive training procedure for the hyperbox classifier. The case study utilizes data from the work of Janairo [23], which looks into the classification of the biomineralization peptide binding affinity using SVM. There are 31 cases, each representing a biomineralization peptide sequence. These biomineralization peptide sequences are characterized by 10 parameters, called the Kidera factors, and the peptides are classified into two categories of either weak or strong binding affinity. Kidera factors are descriptors related to the structure and physicochemical properties of proteins derived from rigorous statistical analyses [29]. The 10 Kidera factors are K1: helix/bend preference, K2: side-chain size, K3: extended structure preference, K4: hydrophobicity, K5: double-bend preference, K6: partial specific volume, K7: flat extended preference, K8: occurrence in alpha region, K9: pK-C, K10: surrounding hydrophobicity. The datapoints were randomized and further divided into two sets with 20 datapoints used for training and the remaining 11 used for validation. The data was initially normalized to transform to * , where is the lowest value in dimension i among all samples j while is the largest value observed in dimension i among all samples j. The raw data for all 31 datapoints are included in Table  S1 of the supplementary material.
The training was done by solving (1) subject to constraints defined in (2) to (17) with a  = 0.05, Z L ik = −50.00 and Z U ik = 50.00. The model was implemented in LINGO 18.0 and solved using a laptop with an Intel ® Core TM i7-6500U 2.5 GHz CPU with 8.0 GB RAM. The case study utilizes data from the work of Janairo [23], which looks into the classification of the biomineralization peptide binding affinity using SVM. There are 31 cases, each representing a biomineralization peptide sequence. These biomineralization peptide sequences are characterized by 10 parameters, called the Kidera factors, and the peptides are classified into two categories of either weak or strong binding affinity. Kidera factors are descriptors related to the structure and physico-chemical properties of proteins derived from rigorous statistical analyses [29]. The 10 Kidera factors are K1: helix/bend preference, K2: side-chain size, K3: extended structure preference, K4: hydrophobicity, K5: double-bend preference, K6: partial specific volume, K7: flat extended preference, K8: occurrence in alpha region, K9: pK-C, K10: surrounding hydrophobicity. The datapoints were randomized and further divided into two sets with 20 datapoints used for training and the remaining 11 used for validation. The data was initially normalized to transform X ji to X * ji , where x MIN ji is the lowest value in dimension i among all samples j while x MAX ji is the largest value observed in dimension i among all samples j. The raw data for all 31 datapoints are included in Table S1 of the supplementary material.
The training was done by solving (1) subject to constraints defined in (2) to (17) with a ∆ = 0.05, Z L ik = −50.00 and Z U ik = 50.00. The model was implemented in LINGO 18.0 and solved using a laptop with an Intel ® Core TM i7-6500U 2.5 GHz CPU with 8.0 GB RAM.

Case Study
Using just one hyperbox, with ε < 0 and a constraint that at least one attribute can be removed, the optimal solution was obtained in 0.20 s. The optimal solution was able to correctly classify all training data such that α = β = 0. The resulting dimensions for the 10 attributes considered are shown in Table 1 where shaded entries indicate that there is no limit for the lower or upper bound for the corresponding attribute. Table 1 can be translated into IF-THEN rules as follows. Only five attributes remained relevant-K3, K6, K7, K9, and K10-with K3, K6, K7, and K9 having one-sided limits and K10 being the only attribute bound by an upper and a lower limit.  Using this rule to classify the validation data resulted in all weak binding samples (total of eight out of 11) being correctly classified and all strong binding samples (total of three out of three) also correctly classified. The confusion matrix for this optimal rule is summarized in Table 2. The previous SVM classifier also found that K3, K6, K7, and K9 as significant descriptors for the prediction of the biomineralization peptide binding affinity class [23]. However, the SVM classifier arrived at this model after 12 optimization steps, as opposed to the quick and straightforward manner of the present hyperbox model. Moreover, using the same set of descriptors, the hyperbox model outperforms the prediction accuracy of the SVM classifier, which was 92%.

N = 11 Predicted Positive Predicted Negative
Actual Positive 3 0 Actual Negative 0 8 α = 0.0 β = 0.0 It is also possible to identify alternative sets of rules from degenerate solutions or near-optimal solutions. An example is given in Table 3, which corresponds to the 10th solution for the problem considered. The rule generated is relatively more complex than Rule 1 because seven attributes (i.e., K2, K5, K6, K7, K8, K9, and K10) are needed to predict peptide binding affinity.  This rule was able to classify the training data correctly with α and β still equal to 0. Similarly, it was effective in classifying all 11 samples in the validation data as summarized in Table 4.
The model is then extended to consider five hyperboxes with ε = 0.30. The consideration of additional hyperboxes enables the possibility of identifying alternative rule sets which can be used to classify data and potentially improve the performance of the model. Five hyperboxes, for example, translates to having five different rule sets to classify objects. Additional constraints to limit attribute overlaps between the boxes have been added as indicated in (19) to (21).
Optimizing with n A = 3 results in the boundaries summarized in Table 5. The optimal solution was obtained in 5 s computational time and was able to classify seven out of nine positive samples and 11 out of 11 negative samples from the training data. The rules are disjunctive, and can be summarized as follows:  If a sample meets at least one of these rules, then the sample can be considered to have Strong binding. The rules were then used to evaluate the validation data; its performance is summarized in Table 6. Table 6. Confusion matrix of Hyperbox Decision Model 3.

N = 11 Predicted Positive Predicted Negative
Actual Positive 3 0 Actual Negative 2 6 α = 0.25 β = 0.0 Again, an alternative set of rules (fifth near-optimal solution) is explored and the results are as shown in Table 7. These can be translated into the following:  This alternative solution was able to classify eight out of nine positive samples and 11 out of 11 negative samples, resulting in α = 0 and β = 0.1111 for the training data set. The rules were then used for the validation data set and the confusion matrix is shown in Table 8.

N = 11 Predicted Positive Predicted Negative
Actual Positive 3 0 Actual Negative The performance of the algorithm was further tested by performing the procedure five more times using a different sampling of training and validation data each time. The k-fold validation was completed in 1 min and 38 s. The performance of the rules for training and validation data is summarized in Table 9.
The variables that the enhanced hyperbox algorithm automatically determined to be significant in making the biomineralization peptide binding class prediction were K3 (extended structure preference), K6 (partial specific volume), K7 (flat extended preference), K9 (pK-C), and K10 (surrounding hydrophobicity). The inclusion of these variables in the algorithm affirms and reinforces the findings of past studies which systematically analyzed the factors that governed peptide binding to surfaces. The inclusion of the variables that relate to the peptide conformation (K3 and K7) and protonation state (K9) are consistent with the findings of Hughes et al., wherein they concluded that peptide conformation is a major feature that influences the size, shape, and stability of peptide-capped materials [30]. In addition, the incorporation of peptide variables related to water interaction, such as K6 and K10, likewise supports the results of the atomistic simulations of Verde et al., wherein they reported how peptide solvation influences structural flexibility and surface adsorption [31]. Thus, the presented enhanced hyperbox model has formalized these associations into a concise classifier, with a clearly articulated set of rules. Aside from transparency on how the decision was reached through rule generation, another major advantage of the enhanced hyperbox model is its accuracy, as shown in Table 10. The present model outperforms common machine learning algorithms, which were simulated in R [32], in terms of accuracy, sensitivity, and specificity.

Conclusions
In this work, we developed a hyperbox-based ML technique that can accurately predict the binding affinity class of a biomineralization peptide based from the sequence. The rule-based model highlights a quick and straightforward model-building capability which does not compromise prediction accuracy. Interactive training via an MILP model allows the hyperbox technique to combine expert knowledge within formation drawn from small data sets that are typical in material science applications. The model also features a clear and transparent set of rules from which the predictions are based, making the classification tasks reproducible and the process interpretable. The presented model is a valuable addition to machine learning tools, which are becoming pivotal components in materials discovery and development. In particular, the presented model can accelerate the discovery of biomineralization peptides while minimizing trial-and-error, thereby reducing cost. The use of this ML technique for other problems in materials science and nanotechnology should thus be explored further. Future work can focus on expanding the model to automatically adjust the margin between the concentric boxes, establishing heuristics for defining the number of hyperboxes, and analyzing how variations in these parameters can potentially influence the performance of the model.