# Fast and Accurate Prediction of Refractive Index of Organic Liquids with Graph Machines

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

_{1}and n

_{2}allows one to calculate the relationship between the angles of incidence θ

_{1}and refraction θ

_{2}according to the so-called Snell–Descartes’s law (Equation (1)).

_{1}sin θ

_{1}= n

_{2}sin θ

_{2}

_{1}, knowledge of n

_{1}and n

_{2}allows predicting θ

_{2}. In particular, when the medium 1 is air, whose refractive index is very close to 1, the accurate measurement of θ

_{2}provides the refractive index of a liquid n

_{2}by applying Equation (1). Given the ease of such measurements, the refractive indices of thousands of liquids are known with high accuracy (≈10

^{−3}with standard refractometers) and are readily available in the literature.

_{1}and n

_{2}coincide. Refraction and reflection phenomena are eliminated, and the light passes through the heterogeneous material as if it were isotropic. For example, to make a transparent toothpaste, the refractive index of the abrasive particles (SiO

_{2}, x H

_{2}O) must be equal to the refractive index of the water-based toothpaste matrix. As the refractive index of hydrated silica (n ≈ 1.44) is significantly higher than that of water (n = 1.333), it is necessary to add to the aqueous phase a very precise amount of an edible liquid with a higher refractive index, such as sorbitol (n = 1.525), in order to match the refractive indices of the particles and the matrix [1].

_{1}and n

_{2}must be maximized because, for “natural light” (i.e., unpolarized light), the intensity of specular reflection increases with the difference (n

_{1}− n

_{2}) according to Fresnel’s law of reflection and Schilck’s approximate equation (Equations (2) and (3)):

_{0}is the reflection coefficient for light incoming perpendicular to the interface between the two media 1 and 2 and θ

_{1}is the angle of incidence.

_{2}(rutile form), whose refractive index (n = 2.75) is much higher than that of the organic matrix (n ≈ 1.48). The second method consists in formulating the liquid paint beyond the CPVC (critical pigment volume concentration) so that, after drying, some microbubbles of air remain inside the film, whose refractive index (n = 1.00) is significantly lower than that of the surrounding matrix. The latter strategy is cheaper than the former, but the resulting coating is porous and has low mechanical strength. It is nevertheless suitable for whitening ceilings where the coatings are not subject to mechanical stress [2].

^{2}·V

^{−1}) expresses the tendency of the molecule to acquire an electric dipole moment when subjected to an electric field, ε

_{0}(F·m

^{−1}or C·m

^{−1}·V

^{−1}) is the vacuum permittivity, and ${V}_{m}$ (m

^{3}) is the molar volume. However, in the literature, the Lorentz-Lorenz relationship is more frequently expressed using the polarizability volume α’ = α/(4πε

_{0}) instead of the polarizability α, which leads to Equation (5):

_{m}are expressed in the same unit (m

^{3}) and have the same order of magnitude.

_{m}of a compound as a sum of the ratios ${\alpha}_{i}^{\prime}$/V

_{m,i}of all its functional groups i (Equation (7)):

## 2. Results

#### 2.1. Graph Machine Model Selection

#### 2.2. Performance of the Selected Graph Machine-Based Model on TCI Datasets

_{est}. equal to 1.363). The same computation made without a stereochemical label for the enol form (i.e., SMILES equal to CC(C=C(O[H])C(F)(F)F)=O) leads to a less relevant estimate (RI

_{est}. = 1.405). Similarly, the RI computation for the most stable enol form of 2-acetylcyclohexan-1-one, shown at the bottom right of Figure 1, leads to a value of 1.505, more in line with the measured value (RI

_{exp}. = 1.510). These predictions, made with enol forms, lead to values very close to those measured, indicating that in this case, it would be preferable to input the SMILES of the enol forms for the GM construction of these molecules. This also highlights a very important property of graph machines: the ability to detect anomalies in the data, either in the measured values or in the input codes.

_{exp}., equal to 1.524 (Figure 2c). This trick forces the algorithm to position the root node of the pyridazine graph on a carbon atom rather than on a nitrogen atom, as for 3-methylpyridazine. For the three examples detailed above, an improvement in the prediction would be obtained by increasing the number of structurally related compounds in the training set. Another approach is described in Section 4, in which all the compounds of the test set are incorporated into an extended training database.

^{2}are above 0.99 for both sets. Data points for the five molecules discussed above are also shown, as is that for thieno[2,3-b]thiophene, which shares with pyridazine the largest positive deviation in prediction.

#### 2.3. Head-to-Head Comparison of the Graph Machine-Based Model with Three Models Reported by Other Authors

#### 2.3.1. Rectification of the Datasets

#### 2.3.2. Performance Comparison of the Three Models

## 3. Materials and Methods

#### 3.1. Data Analysis and Curation

#### 3.2. Analysis of Homologous Series

_{2}, but could be any other repeating unit such as CF

_{2}or Si(CH

_{3})

_{2}O), it can be shown based on Equation (4) (cf. Supporting Information, Section B) that the refractive index n follows the law given in Equation (10):

^{2}for the molecule with no repeating unit (see last two columns of Table 5). We made two different comparisons. The first one only compares molecules with different functional groups, sharing CH

_{2}as the repeating unit. In this case, ${n}_{repeat}^{2}$ is expected to be n

^{2}for polyethylene. We thus used the known value ${n}_{repeat}$ = 1.476 [44]. For the five series chosen as an example, Equation (10) is closely followed, only fitting B and C (listed in Table 5) to experimental data (cf. Figure 6a, data points listed in Table S2 of Supporting Information, Section C). Note that Equation (10) assumes that if the refractive index of the polymer composed of repeating units is equal to that of the molecule without repeating units, then the refractive index of the whole homologous series will be constant. This is approximately the case for the α,ω-diaminoalkanes in Figure 6a, since n

_{PE}is nearly equal to n

_{hydrazine}.

_{2}, CF

_{2}, and Si(CH

_{3})

_{2}O functional groups.

#### 3.3. Graph Machine Modeling

- Construction of the 2D-graph of the molecule from its SMILES representation: each node of the graph is a non-H atom, and each edge of the graph is a chemical bond. Each node has at least two labels: the nature of the atom and its degree (the number of chemical bonds that bind it to its adjacent non-H atoms). For molecules that contain stereochemical information such as E/Z configurations, or wedge bonds, and hence R/S configurations, additional labels that we have named iso and chi are added to the relevant nodes. For molecules that contain cycles and are hence represented by a cyclic graph, one edge is deleted for each cycle of the molecule in order to form an acyclic graph in which every path of the graph ends at a specific node called the root or output node.
- Construction of the computational structure: for each acyclic graph, a function is generated by implementing, at each node of the graph, a parameterized nonlinear function called the node function, typically a multi-layer perceptron (MLP) with tanh activation functions for the hidden neurons and a linear output neuron. Since this construction does not require any descriptor, biases (neurons with non-trainable outputs equal to 1) are used instead of traditional inputs for the MLP, typically one for each label (e.g., Cl-h0, D3-h0, and iso1-h0 in Figure 7). The trick of this construction is to use the same function for all nodes and for all graphs. Therefore, the number of parameters in the resulting model is equal to the number of parameters in the chosen node function. As a result of this construction, the value computed by the output node of each model, which is intended to be an estimate of the refractive index, depends solely on the 2D structure of the molecule and the node function parameter values.
- Estimation of the parameters of the node function by training from the database: this is done by minimizing the sum of squared errors J(
**θ**) defined in the next subsection.

#### 3.4. Model Selection

_{T}elements, by minimizing, using the weight sharing method between all nodes of all graph machines, the sum of squared errors of the cost function J(

**θ**) (Equation (11)):

_{i}is the measured value of the RI for the i-th element of the training set,

**θ**is the vector of parameters, and g

_{i}(

**θ**) is the value of the RI estimated by the graph machine for that element. In this work, g

_{i}(

**θ**) is constructed as a combination of MLPs with a single hidden layer that reflects the graph structure of the i-th element. This MLP is a linear combination of nonlinear functions called hidden neurons, which are the hyperbolic tangent functions of a linear combination of the variables. All minimizations of the cost function are performed by the Levenberg–Marquardt algorithm, which is well suited to optimization problems with a moderate number of variables [46].

_{T}graph machines, while the LOO score computation requires the training of 3516 graph machine-based models containing each N

_{T}−1 graph machine. As explained in a previous paper [30], this score relies on a first-order approximation of the estimation error that would be obtained on each molecule of the training set if that molecule had been removed from that set before training. Thus, denoting by

**θ**

_{m}the parameter vector after completion of training, the VLOO score (Equation (12)) is defined as the root-mean-square of the VLOO prediction errors:

_{i}is the measured value of the RI for the i-th element of the training set. The VLOO score is consequently an estimate of the model’s generalization error. A detailed mathematical analysis of VLOO is provided by Monari and Dreyfus [47].

- Launch a large number (e.g., 100) of parameter initializations followed by a full training computation;
- Select a small number (e.g., 10) of results with the smallest VLOO values;
- Use these selected models to compute the average property estimation for a fresh example.

**θ**

_{m}) are assigned to its node functions so that the graph machine output provides an estimate of the RI for that molecule. The true benefit of this approach is the absence of descriptors; the SMILES codes are the only required information. Moreover, the same set of graph machines can be reused to estimate another property or activity after a re-training of the model. More details on graph machine construction and training can be found in an earlier paper [28].

## 4. Discussion

_{Si42}, R

_{Si42}, V

_{P31}, R

_{P31}, V

_{P32}, and R

_{P32}, are not available. This situation does not arise with graph machines if the atom is already present during training, like Si or P in the cases above. If the atom is absent from the training set, a simple re-train after adding to the training set a few compounds with known refractive indices and containing the missing atom can then secure the GM24 prediction.

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

^{2}model implemented in the BIOVIA COSMOtherm software, release 2023. F.D. would like to thank G. Dreyfus for his help with methodology and conceptualization and for his proofreading of the manuscript.

## Conflicts of Interest

## Sample Availability

## References

- Teoman, B.; Potanin, A.; Armenante, P.M. Optimization of optical transparency of personal care products using the refractive index matching method. Colloids Surf. A Physicochem. Eng. Asp.
**2021**, 610, 125595. [Google Scholar] [CrossRef] - Patton, T.C. Paint Flow and Pigment Dispersion: A Rheological Approach to Coating and Ink Technology, 2nd ed.; Wiley: Hoboken, NJ, USA, 1979. [Google Scholar]
- Israelachvili, J.N. Intermolecular and Surface Forces, 3rd ed.; Academic Press: Burlington, VT, USA, 2011. [Google Scholar]
- Hansen, C.M. Hansen Solubility Parameters: A User’s Handbook, 2nd ed.; Taylor & Francis: Boca Raton, FL, USA, 2007. [Google Scholar]
- Gaudin, T.; Benazzouz, A.; Aubry, J.-M. Robust definition and prediction of dispersive Hansen solubility parameter δD with COSMO-RS. Comput. Theor. Chem.
**2023**, 1221, 114023. [Google Scholar] [CrossRef] - Theisen, A.; Johann, C.; Deacon, M.P.; Harding, S.E. Refractive Increment Data-Book for Polymer and Biomolecular Scientists; Nottingham University Press: Nottingham, UK, 2000. [Google Scholar]
- Hoshino, D.; Nagahama, K.; Hirata, M. Prediction of refractive index of aliphatic hydrocarbons by the group contribution method. Sekiyu Gakkaishi
**1979**, 22, 5. [Google Scholar] [CrossRef] - Hoshino, D.; Nagahama, K.; Hirata, M. Prediction of the latent head of vaporization at normal boiling point by use of refractive index. Sekiyu Gakkaishi
**1981**, 24, 5. [Google Scholar] [CrossRef] - Gakh, A.A.; Gakh, E.G.; Sumpter, B.G.; Noid, D.W. Neural Network-Graph Theory Approach to the Prediction of the Physical Properties of Organic Compounds. J. Chem. Inf. Comput. Sci.
**1994**, 34, 832–839. [Google Scholar] [CrossRef] - Wiener, H. Structural Determination of Paraffin Boiling Points. J. Am. Chem. Soc.
**1947**, 69, 17–20. [Google Scholar] [CrossRef] - Katritzky, A.R.; Sild, S.; Karelson, M. General Quantitative Structure−Property Relationship Treatment of the Refractive Index of Organic Compounds. J. Chem. Inf. Comput. Sci.
**1998**, 38, 840–844. [Google Scholar] [CrossRef] - Cocchi, M.; De Benedetti, P.G.; Seeber, R.; Tassi, L.; Ulrici, A. Development of Quantitative Structure−Property Relationships Using Calculated Descriptors for the Prediction of the Physicochemical Properties (nD, ρ, bp, ε, η) of a Series of Organic Solvents. J. Chem. Inf. Comput. Sci.
**1999**, 39, 1190–1203. [Google Scholar] [CrossRef] - Fioressi, S.E.; Bacelo, D.E.; Cui, W.P.; Saavedra, L.M.; Duchowicz, P.R. QSPR study on refractive indices of solvents commonly used in polymer chemistry using flexible molecular descriptors. SAR QSAR Environ. Res.
**2015**, 26, 499–506. [Google Scholar] [CrossRef] - Ha, Z.; Ring, Z.; Liu, S. Quantitative Structure−Property Relationship (QSPR) Models for Boiling Points, Specific Gravities, and Refraction Indices of Hydrocarbons. Energy Fuels
**2005**, 19, 152–163. [Google Scholar] [CrossRef] - Katritzky, A.R.; Sild, S.; Karelson, M. Correlation and Prediction of the Refractive Indices of Polymers by QSPR. J. Chem. Inf. Comput. Sci.
**1998**, 38, 1171–1176. [Google Scholar] [CrossRef] - Krishnaraj, S.; Neelamegam, P. Prediction of refractive index of organic compounds using structure-property studies. Res. J. Pharm. Biol. Chem. Sci.
**2012**, 3, 597–611. [Google Scholar] - Redmond, H.; Thompson, J.E. Evaluation of a quantitative structure–property relationship (QSPR) for predicting mid-visible refractive index of secondary organic aerosol (SOA). Phys. Chem. Chem. Phys.
**2011**, 13, 6872. [Google Scholar] [CrossRef] [PubMed] - Gharagheizi, F.; Ilani-Kashkouli, P.; Kamari, A.; Mohammadi, A.H.; Ramjugernath, D. Group Contribution Model for the Prediction of Refractive Indices of Organic Compounds. J. Chem. Eng. Data
**2014**, 59, 1930–1943. [Google Scholar] [CrossRef] - Cai, C.; Marsh, A.; Zhang, Y.-h.; Reid, J.P. Group Contribution Approach To Predict the Refractive Index of Pure Organic Components in Ambient Organic Aerosol. Environ. Sci. Technol.
**2017**, 51, 9683–9690. [Google Scholar] [CrossRef] - Bouteloup, R.; Mathieu, D. Improved model for the refractive index: Application to potential components of ambient aerosol. Phys. Chem. Chem. Phys.
**2018**, 20, 22017–22026. [Google Scholar] [CrossRef] [PubMed] - Kragh, H. The Lorenz-Lorentz Formula: Origin and Early History. Substantia
**2018**, 2, 7–18. [Google Scholar] [CrossRef] - Mathieu, D.; Alaime, T. Insight into the contribution of individual functional groups to the flash point of organic compounds. J. Hazard. Mater.
**2014**, 267, 169–174. [Google Scholar] [CrossRef] - Mathieu, D.; Bouteloup, R. Reliable and Versatile Model for the Density of Liquids Based on Additive Volume Increments. Ind. Eng. Chem. Res.
**2016**, 55, 12970–12980. [Google Scholar] [CrossRef] - BIOVIA COSMOtherm, Release 2023; Dassault Systèmes: Vélizy-Villacoublay, France, 2022.
- Klamt, A. Conductor-like Screening Model for Real Solvents: A New Approach to the Quantitative Calculation of Solvation Phenomena. J. Phys. Chem.
**1995**, 99, 2224–2235. [Google Scholar] [CrossRef] - Klamt, A.; Jonas, V.; Bürger, T.; Lohrenz, J.C.W. Refinement and Parametrization of COSMO-RS. J. Phys. Chem. A
**1998**, 102, 5074–5085. [Google Scholar] [CrossRef] - Eckert, F.; Klamt, A. Fast solvent screening via quantum chemistry: COSMO-RS approach. AIChE J.
**2002**, 48, 369–385. [Google Scholar] [CrossRef] - Goulon, A.; Picot, T.; Duprat, A.; Dreyfus, G. Predicting activities without computing descriptors: Graph machines for QSAR. SAR QSAR Environ. Res.
**2007**, 18, 141–153. [Google Scholar] [CrossRef] [PubMed] - Goussard, V.; Duprat, F.; Gerbaud, V.; Ploix, J.-L.; Dreyfus, G.; Nardello-Rataj, V.; Aubry, J.-M. Predicting the Surface Tension of Liquids: Comparison of Four Modeling Approaches and Application to Cosmetic Oils. J. Chem. Inf. Model.
**2017**, 57, 2986–2995. [Google Scholar] [CrossRef] [PubMed] - Goussard, V.; Duprat, F.; Ploix, J.-L.; Dreyfus, G.; Nardello-Rataj, V.; Aubry, J.-M. A New Machine-Learning Tool for Fast Estimation of Liquid Viscosity. Application to Cosmetic Oils. J. Chem. Inf. Model.
**2020**, 60, 2012–2023. [Google Scholar] [CrossRef] [PubMed] - Delforce, L.; Duprat, F.; Ploix, J.-L.; Ontiveros, J.F.; Goussard, V.; Nardello-Rataj, V.; Aubry, J.-M. Fast Prediction of the Equivalent Alkane Carbon Number Using Graph Machines and Neural Networks. ACS Omega
**2022**, 7, 38869–38881. [Google Scholar] [CrossRef] - Reaxys; Elsevier. Available online: https://www.reaxys.com (accessed on 1 December 2022).
- Park, J.D.; Brown, H.A.; Lacher, J.R. A Study of Some Fluorine-containing β-Diketones. Journal of the American Chemical Society
**1953**, 75, 4753–4756. [Google Scholar] [CrossRef] - Wohlfarth, C.; Wohlfarth, B.; Landolt, H.; Börnstein, R. Optical Constants Refractive Indices of Organic Liquids; Lechner, M.D., Ed.; Springer: Berlin/Heidelberg, Germany, 1996; Volume III38/B, p. 2639. [Google Scholar]
- SciFinder; Chemical Abstracts Service: Columbus, O. Experimental Properties: Optical and Scattering. Available online: https://scifinder.cas.org (accessed on 1 September 2023).
- Gattow, G.; Krebs, B. Über Trithiokohlensäure H2CS3. Angew. Chem.
**1962**, 74, 29. [Google Scholar] [CrossRef] - Budavari, S.; O’Neil, M.J.; Smith, A.; Heckelman, P.E. The Merck Index, 11th ed.; Budavari, S., Ed.; MERCK & Co., Inc.: Rahway, NJ, USA, 1989. [Google Scholar]
- Lide, D.R.; Bruno, T.J. CRC Handbook of Chemistry and Physics, 97th ed.; Haynes, W.M., Ed.; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
- Godt, H.C.; Wann, R.E. The Synthesis of Organic Trithiocarbonates. J. Org. Chem.
**1961**, 26, 4047–4051. [Google Scholar] [CrossRef] - Wohlfarth, C.; Landolt, H.; Bornstein, R. Optical Constants Refractive Indices of Organic Liquids (Supplement to III/38); Lechner, M.D., Ed.; Springer: Berlin/Heidelberg, Germany, 2008; Volume III/47. [Google Scholar]
- Wohlfarth, C.; Wohlfarth, B.; Landolt, H.; Bornstein, R. Optical Constants Refractive Indices of Inorganic, Organometallic, and Organononmetallic Liquids, and Binary Liquid Mixtures; Lechner, M.D., Ed.; Springer: Berlin/Heidelberg, Germany; New York, NY, USA, 1996; Volume III38/A, p. 1064. [Google Scholar]
- Pubchem; National Institutes of Health. Available online: https://pubchem.ncbi.nlm.nih.gov/ (accessed on 1 September 2023).
- Vollhardt, K.P.C. Organic Chemistry; W. H. Freeman & Co: New York, NY, USA, 1987. [Google Scholar]
- Bicerano, J. Prediction of Polymer Properties, 3rd ed.; Marcel Dekker: New York, NY, USA, 2002; p. 784. [Google Scholar]
- Dioury, F.; Duprat, A.; Dreyfus, G.; Ferroud, C.; Cossy, J. QSPR Prediction of the Stability Constants of Gadolinium(III) Complexes for Magnetic Resonance Imaging. J. Chem. Inf. Model.
**2014**, 54, 2718–2731. [Google Scholar] [CrossRef] - Dreyfus, G. Neural Networks: Methodology and Applications; Springer: Berlin, Germany; New York, NY, USA, 2005; p. 497. [Google Scholar]
- Monari, G.; Dreyfus, G. Local Overfitting Control via Leverages. Neural Comput.
**2002**, 14, 1481–1506. [Google Scholar] [CrossRef] [PubMed] - Godbout, G.; Sicotte, Y. La relation entre l’indice de réfraction et la densité dans les liquides purs. J. Chim. Phys.
**1968**, 65, 1944–1948. [Google Scholar] [CrossRef]

**Figure 1.**Structure of 1,3-diketones and keto-enol forms for 1,1,1-trifluoropentane-2,4-dione and 2-acetylcyclohexan-1-one, and SMILES codes used for RI computations with the GM24 model.

**Figure 2.**Structures (

**a**–

**c**) of the test set compounds that have the largest negative and positive deviations for their computed RI using the GM24 model. RI

_{exp}., RI

_{est}. and Dev. stand for experimental RI, estimated RI, and deviation.

**Figure 3.**Scatter plot of refractive index estimations for the 3516 molecules of the TCI training set (blue disks) and of refractive index predictions for the 3515 molecules of the TCI test set (red circles) computed by graph machines vs. measured refractive index values. The black line is the bisector of the plot.

**Figure 4.**Scatter plot of refractive index predictions computed by geometrical fragment method [20] (blue disks) and graph machines (red circles) vs. measured refractive index values for the 1366 molecules in the CRC test set. The black line is the bisector of the plot.

**Figure 5.**Example of dataset simplification for the three 2-ethyloxiranes shown with their structure, registry number, isomeric SMILES, and refractive index value. The stereogenic center is marked with an asterisk.

**Figure 7.**Coding of (

**a**) (Z)-1,2-dichloroethene and (

**b**) (2S,3S)-2,3-dimethyloxirane from their 2D-structure into their directed graph (①) and graph machine (②). To simplify the GM representations, some bias inputs are omitted, and the implemented node functions are MLPs with zero hidden neurons. The red wavy line indicates a cycle opening in step ① to obtain an acyclic graph. The asterisks on the nodes of graph machine (

**b**) correspond to the carbon atoms between which a bond has been broken.

**Figure 8.**Scatter plot of refractive index predictions computed by graph machines vs. measured refractive index values for the 175 compounds of the MIX test. Red disks are for pairs of diastereomers, and blue disks are for other compounds. The dashed line (y = 0.998x + 0.004) is the regression line for the total set.

**Table 1.**Estimation of the refractive index from SMILES by graph machine-based models of increasing complexity.

Number of Hidden Neurons | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 | 22 | 24 | 26 |
---|---|---|---|---|---|---|---|---|---|---|---|

RMSTE (10^{−3}) ^{1} | 11 | 10 | 9 | 8 | 7 | 6 | 6 | 5 | 5 | 4 | 4 |

VLOOs (10^{−3}) ^{2} | 12 | 11 | 10 | 9 | 8 | 8 | 7 | 7 | 7 | 6 | 6 |

MIN (10^{−3}) ^{3} | −47 | −42 | −41 | −40 | −39 | −38 | −33 | −29 | −24 | −19 | −18 |

MAX (10^{−3}) ^{3} | 103 | 87 | 72 | 60 | 46 | 37 | 32 | 30 | 29 | 26 | 24 |

^{1}Mean of the RMSTE values and

^{2}mean of the VLOO scores, averaged over the 10 trained models (out of 100) having the smallest VLOO scores, both computed for three different parameter initializations, for the 3516 molecules of the training set; standard deviation for all means is smaller than 10

^{−4}, and

^{3}MIN and MAX denote the means of the maximum and minimum deviations from experiment.

Dataset | ${\mathit{N}}_{\mathit{T}}$ ^{1} | RMSE ^{2} | R^{2} | MIN ^{3} | MAX ^{3} | STE ^{4} |
---|---|---|---|---|---|---|

Training | 3516 | 0.003 | 0.998 | −0.019 | 0.026 | 0.002 |

Test | 3515 | 0.006 | 0.990 | −0.051 | 0.036 | 0.006 |

^{1}Number of elements in datasets,

^{2}root mean square error averaged over the 10 trained models (out of 100) having the smallest VLOO scores for the 3516 molecules of the training set,

^{3}minimum and maximum deviations from experiment and

^{4}root mean square error computed for compounds with a stereochemical label in their graph machine.

Compound Name | CRC RI _{exp}. ^{1} | GM24 RI _{pred}. ^{2} | Other Sources RI ^{3} | Revised RI | MP or BP (°C) ^{4} |
---|---|---|---|---|---|

Dimethyl fumarate | 1.406 (110) | 1.443 | 1.406@111 [34] | - | 101.7 (mp) |

1,1-Difluoroethane | 1.301 (−72) | 1.271 | 1.301@–72 [35] | - | −24 (bp) |

(Dichlorofluoromethyl) benzene | 1.518 (11) | 1.514 | 1.514@20 [34] 1.513@20 [35] | 1.514 | liq. |

1,1,1-Trichloro-2,2,2-trifluoroethane | 1.361 (35) | 1.365 | 1.360@20 [34] 1.360@20 [35] | 1.360 | liq. |

Glycerol 1-acetate | 1.416 (20) | 1.451 | 1.450@20 [34] 1.450@20 [35] | 1.450 | liq. |

Cyclohexylidene- acetonitrile | 1.438 (25) | 1.489 | 1.483@25 [32] 1.483@25 [35] | 1.483 | liq. |

^{1}Values in brackets correspond to the measurement temperatures in °C,

^{2}predicted RI using graph machine-based model at 20 °C,

^{3}@T means measured at T °C as found in cited sources, and

^{4}Liq. indicates that the compound is a liquid at 20 °C.

Dataset | ${\mathbf{N}}_{\mathit{T}}$^{1} | Test RMSE (10^{−3}) | |||
---|---|---|---|---|---|

QSPR | GC | GF | GM24 ^{2} | ||

HR-JT | 52 | 20 ^{4} | 65 ^{3,5} | 16 | 10 |

CCAI | 116 | 62 ^{3,6} | 14 ^{4} | 17 | 11 |

CRC | 1366 | - | - | 16 | 10 |

^{1}Number of elements in datasets,

^{2}test root mean square error averaged over the 10 trained models (out of 100) having the smallest VLOO scores for the 3516 molecules of the training set,

^{3}test RMSE are computed only for molecules that are not present in the training sets used for model parameterization, that is, 40 and 108 molecules for the HR-JT and CCAI sets, respectively,

^{4}values in italics are RMSTE instead of test RMSE,

^{5}GC predictions are from the paper of Cai et al. [19], and

^{6}QSPR predictions are calculated with the equation given in the Redmond and Thompson paper [17].

**Table 5.**Fitted ${n}_{repeat}$, B and C coefficients from Equation (10) based on refractive indices for 7 homologous series.

Homologous Series | ${\mathit{n}}_{\mathit{r}\mathit{e}\mathit{p}\mathit{e}\mathit{a}\mathit{t}}$ | B | C | Initial Member (With No Repeat Unit) | ${\mathit{n}}_{\mathit{e}\mathit{x}\mathit{p}.}$ | $\sqrt{\mathit{B}/\mathit{C}}$ |
---|---|---|---|---|---|---|

n-Alkanes | 1.469 ^{1}(${n}_{PE}$) | 4.786 | 3.139 | ethane | n/a (gas) | 1.235 |

1-Iodoalkanes | 4.837 | 2.063 | iodomethane | 1.531 | 1.531 | |

Primary alcohols | 6.008 | 3.407 | methanol | 1.329 | 1.328 | |

Diaminoalkanes | 207.670 | 97.681 | hydrazine | 1.457 | 1.458 | |

Diodoalkanes | 7.758 | 2.281 | I_{2} | n/a (solid) | 1.844 | |

n-Alkanes | 1.475 ^{2} | 4.556 | 3.011 | ethane | n/a (gas) | 1.230 |

perfluoroalkanes | 1.441 | 19.667 | 13.829 | perfluoroethane | n/a (gas) | 1.193 |

methylated siloxanes | 1.398 | 2.021 | 1.067 | hexamethyldisiloxane | 1.377 | 1.376 |

^{1}${n}_{repeat}$ coefficient (corresponding to the refractive index of polyethylene n

_{PE}) fitted for all five homologous series and

^{2}${n}_{repeat}$ fitted from experimental data instead of assumed equal to that of polyethylene.

Dataset | ${\mathit{N}}_{\mathit{T}}$ ^{1} | RMSE ^{2} | R^{2} | MIN ^{3} | MAX ^{3} | STE ^{4} |
---|---|---|---|---|---|---|

Training | 8267 | 0.004 | 0.995 | −0.024 | 0.028 | 0.003 |

MIX Test | 175 | 0.007 | 0.988 | −0.022 | 0.022 | 0.005 |

^{1}Number of elements in datasets,

^{2}root mean square error averaged over the 10 trained models (out of 100) having the smallest VLOO scores for the 8267 molecules of the training set,

^{3}minimum and maximum deviations from experiment, and

^{4}root mean square error computed for the 22 compounds that have a stereochemical label in their graph machine.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Duprat, F.; Ploix, J.-L.; Aubry, J.-M.; Gaudin, T.
Fast and Accurate Prediction of Refractive Index of Organic Liquids with Graph Machines. *Molecules* **2023**, *28*, 6805.
https://doi.org/10.3390/molecules28196805

**AMA Style**

Duprat F, Ploix J-L, Aubry J-M, Gaudin T.
Fast and Accurate Prediction of Refractive Index of Organic Liquids with Graph Machines. *Molecules*. 2023; 28(19):6805.
https://doi.org/10.3390/molecules28196805

**Chicago/Turabian Style**

Duprat, François, Jean-Luc Ploix, Jean-Marie Aubry, and Théophile Gaudin.
2023. "Fast and Accurate Prediction of Refractive Index of Organic Liquids with Graph Machines" *Molecules* 28, no. 19: 6805.
https://doi.org/10.3390/molecules28196805