Enhancing Carbon Acid pKa Prediction by Augmentation of Sparse Experimental Datasets with Accurate AIBL (QM) Derived Values

The prediction of the aqueous pKa of carbon acids by Quantitative Structure Property Relationship or cheminformatics-based methods is a rather arduous problem. Primarily, there are insufficient high-quality experimental data points measured in homogeneous conditions to allow for a good global model to be generated. In our computationally efficient pKa prediction method, we generate an atom-type feature vector, called a distance spectrum, from the assigned ionisation atom, and learn coefficients for those atom-types that show the impact each atom-type has on the pKa of the ionisable centre. In the current work, we augment our dataset with pKa values from a series of high performing local models derived from the Ab Initio Bond Lengths method (AIBL). We find that, in distilling the knowledge available from multiple models into one general model, the prediction error for an external test set is reduced compared to that using literature experimental data alone.


Damped Averaging
For each compound, all of the pKa values gathered from the literature were sorted from the lowest to the highest value; the first value was taken and added to a growing array. The next value was examined and compared to the average of the array. If it differed by more than 2 pKa units, the average of the array was added to the list of reduced values and the new value formed the start of a new array. Otherwise, it was added to the array and the next value was compared. In this fashion the multiple different values were condensed into exemplar values. For example, the list of (0.1, 0.2, 0.3, 2.5, 2.6, 3.0) would be condensed down into (0.2, 2.7) thereby allowing for minor experimental disagreement and using the average values for training purposes. This method has the potential to merge different pKa values if they lie within 2 units apart. However, this condition is unlikely except for the case of certain zwitteronic compounds, and in those cases, it is an acceptable compromise to assign one pKa value to the entire molecule to learn from.

Atom-Typer
The atom-typer generates an integer for each atom that encodes four pieces of information: A. The formal charge of the atom. B. The atomic number of the atom. C. The number of non-hydrogen atoms covalently bound to the atom (i.e. a simple measure of sterics). D. Information about the hybridization and local connectivity.
Overall, the integer takes the form of ABBCDD where: A = formal charge + 1. BB = atomic number (e.g. 01 = hydrogen; 06 = carbon; 09 = fluorine) C = Integer count of non-hydrogen atoms covalently bound to the atom. DD = Hybridization value that is dependent on the element being typed.
The formal charge portion is the charge +1 because the formal charge is unlikely to fall out of the range of ±1, so this just keeps the numbers positive. The formal charge, atomic number and non-hydrogen bonding count are essentially self-explanatory but the hybridization value requires further elaboration. Each of the commonly observed elements in organic chemistry space has its own set of hybridization values. These are an attempt to classify the local environment of the atoms and thereby consider various electronic effects. For example, if a carbon is a sp 3 carbon then there are 2 types: 02 and 03 depending on if there is a highly polar group (Z = O, N, P or S) attached to it or not. For the sp 2 carbon types there are two major groups: aromatic and non-aromatic. Both groups are further subdivided into 4 specific bonding patterns with strongly polar atoms. In the case of nitrogen matters are similar, with a major addition being that an aniline-like moiety (01) and an amide like moiety (02) each also receive their own DD value, in order to reflect the specific electronic states of those two atom-types. Oxygens are separated out to split carbonyl types if they are part of amides, carboxylic acids, or thio-acids as each of those carbonyl groups exerts a different effect on a molecule. Fluorine has special cases as well, mostly to account for the case of CF3 groups exerting a non-linear effect with the elimination of each fluorine. If an element is not present in Figure S1, then its default hybridization value is simply the sum of polar bonds to this element, with an additional 10 if the atom is considered aromatic.

QR Coefficient Improvement
When the data are limited, the possibility arises that a solution will be found where two coefficients are very large and with opposite signs. This occurs when two atom-types are always present together and not ever encountered individually in one molecule. An example of this situation are the two oxygen atoms present in the same nitro group, where one oxygen has a charge and the other does not, leading to their atom-type codes being 008104 and 108104. From Table S1 it is evident that a sub-optimal solution has been found for the initial model (Start). If one atom-type of the pair is encountered alone, as is the case for a N-oxide moiety, then a spurious value is generated (of the order of 1014) for the pKa. This results in a nonsense prediction for the pKa value. This pairing of coefficients is not a problem if the atom-types are always encountered together. However, they represent an undesirable solution if an atom-type is not present with its pair partner because an obviously incorrect prediction can occur. Examining the coefficients, excluding the coefficients for the nitro oxygens shows that the standard deviation of the remaining coefficients decreases from 91.8 to 11.0 after the addition of virtual data. So, the QR decomposition is locating a more stable solution, without having large antagonistic coefficients. This means that the predictions are going to be more robust because, instead of having to encounter pairs of coefficients in concert to generate a good prediction, each coefficient reflects its own impact on the pKa of the molecule better.
There are mathematical approaches to guide the Lhasa methodology towards finding the optimal solution, one of which was used with our log P work (reference 5 in main document). In that case every atom-type had 0.01 added to its occurrence in the matrix, resulting in a sparse matrix with a background value of 0.01. This modification in turn resulted in an improved solution for the log P work and is applicable as the rest of the entries were integers with a minimum value of 1. However, this approach is unsuitable for the current pKa work because the values present in the distance spectrum range from 1 to 1/64 (0.015625) per atom. Hence, an addition of 0.01 represents a significant change in the values present, as opposed to the log P work where the minimum value present would be 1 such that 0.01 represents only 1% of the smallest value. Therefore, the only way out of this conundrum is to obtain more data, especially for certain atom-types, which would eliminate the previously described pairs. Table S1. Coefficients from the solved model.

Code
Start 1 a 2 3