Data-Driven and Multiscale Modeling of DNA-Templated Dye Aggregates

Dye aggregates are of interest for excitonic applications, including biomedical imaging, organic photovoltaics, and quantum information systems. Dyes with large transition dipole moments (μ) are necessary to optimize coupling within dye aggregates. Extinction coefficients (ε) can be used to determine the μ of dyes, and so dyes with a large ε (>150,000 M−1cm−1) should be engineered or identified. However, dye properties leading to a large ε are not fully understood, and low-throughput methods of dye screening, such as experimental measurements or density functional theory (DFT) calculations, can be time-consuming. In order to screen large datasets of molecules for desirable properties (i.e., large ε and μ), a computational workflow was established using machine learning (ML), DFT, time-dependent (TD-) DFT, and molecular dynamics (MD). ML models were developed through training and validation on a dataset of 8802 dyes using structural features. A Classifier was developed with an accuracy of 97% and a Regressor was constructed with an R2 of above 0.9, comparing between experiment and ML prediction. Using the Regressor, the ε values of over 18,000 dyes were predicted. The top 100 dyes were further screened using DFT and TD-DFT to identify 15 dyes with a μ relative to a reference dye, pentamethine indocyanine dye Cy5. Two benchmark MD simulations were performed on Cy5 and Cy5.5 dimers, and it was found that MD could accurately capture experimental results. The results of this study exhibit that our computational workflow for identifying dyes with a large μ for excitonic applications is effective and can be used as a tool to develop new dyes for excitonic applications.


Introduction
Organic molecules, which absorb and emit light, also known as dyes, are useful for many applications, such as biomedical imaging [1,2], organic photovoltaics [3,4], nonlinear optics [5], and quantum information systems [6][7][8][9]. Key parameters that determine the performance of the dyes in those applications include the extinction coefficient (ε) and transition dipole moment (µ), as well as aggregation ability. Thus, optimizing the key electronic (e.g., µ) and molecular (e.g., aggregate) features is crucial for the desired applications of dye molecules. The interaction of dyes with light can be quantified via properties. The same photophysical data used to create the chemical-based datasets for dyes have been of interest for organic photovoltaics, and there are several public datasets available [52][53][54][55]. ML techniques applied to chemical space exploration is a rich field with a variety of methods from which to choose. The methods can be hierarchical with the size of the dataset, spanning from well-established supervised learning to more complex artificial neural networks [56].
In this work, a systematic approach, combining ML, DFT, TD-DFT, and MD methods, was used to screen dye monomers from an expansive dataset and provide insight into dye aggregate-DNA duplex interactions. We first used ML to identify ideal dye candidates with high extinction coefficients (ε) from a dataset of around 18,000 molecules. Then, for the 100 ML-selected dye candidates with desirable structural features and high values of ε, DFT and TD-DFT calculations were performed to predict their ground and excited state properties. Finally, benchmark MD simulations were conducted to reveal the interactions between the selected dye dimers and the DNA duplexes.

Machine Learning
Classifier and Regressor models were trained to identify ideal dye candidates with high extinction coefficients (ε) based on dye structure features. The Classifier model could quickly classify the dyes with either high or low ε, where we set a threshold of 150,000 M −1 cm −1 for strong exciton coupling in dye aggregates. As it learned from the Classifier model, the Regressor model could further estimate the values of ε for the dyes. Three data sources, including Deep4Chem [57], PhotoChem CAD 3 [54], and Dyomics GmbH [58] (8802 molecule datapoints in total), were used to train and validate the models. We utilized SMILES format for the molecule. We also utilized RDKit [59] to calculate 284 different features, such as the maximum carbon chain length and aromatic, amide, and ester group counts. The development dataset, containing 90% of data points, was used to train and test various model hyperparameters. Based on the accuracy for the Classifier and R 2 for the Regressor, a model was selected for further analysis. We then used the validation dataset, containing 10% of data points, to validate the selected model's effectiveness. The molecules with ε of above 800,000 M −1 cm −1 were excluded because the ε values deviated too greatly from the threshold of 150,000 M −1 cm −1 . Figure 1 shows the dataset breakdown with high and low ε values. The three data sources all have a data imbalance, where the number of molecules with low ε values is larger than that of molecules with high ε values. The effect of imbalanced data is discussed in Section 3.1.

Density Functional Theory
Density functional theory (DFT) and time-dependent (TD-) DFT calculations were performed to optimize dye structures in the ground state and calculate solvation energies

Density Functional Theory
Density functional theory (DFT) and time-dependent (TD-) DFT calculations were performed to optimize dye structures in the ground state and calculate solvation energies and transition dipole moments. Similarly to our previous work [46], the dyes were optimized with the M06-2X [60] functional and 6-31+G(d,p) basis set, to a residual force of 4.5 × 10 −4 Hartree/Bohr. The M06-2X functional with 6-31+G(d,p) basis set was validated in our prior studies of pristine and substituted cyanine and squaraine dyes [45,46] and has been used successfully for the calculations of the excited state properties of similar systems [61][62][63]. Frequency calculations were conducted to confirm the ground state structures were true minima. To determine µ, single point, vertical excited state calculations using the M06-2X functional were performed on the ground state structures to obtain transitions to the first 30 excited singlet states and identify the state with the largest oscillator strength. Calculations of the ground and excited state properties were conducted with implicit water solvation using the integral equation formalism polarizable continuum model (IEFPCM) [64,65], which was successfully used for the excited state property calculations of similar systems [66][67][68]. Excited state calculations were conducted assuming nonequilibrium solvent conditions.
To approximate the relative hydrophobicity of the dyes, the partitioning coefficient between n-octanol and water, log(P o/w ), was calculated according to [69,70] where ∆G o and ∆G w are the Gibbs free energy of solvation for a dye in n-octanol and water, respectively; R = 8.31 J mol·K ; and T = 273.15 K. A more positive value of log(P o/w ) means a molecule is more hydrophobic, and a more negative value means a molecule is more hydrophilic. In general, the Gibbs free energy of solvation for a molecule (∆G solv ) is a measure of the amount of energy required to dissolve the dye in solvent, and was calculated according to [45,46,68,71] ∆G solv = E solvated − E vacuum (2) where E solvated is the total energy of the dye in implicit solvent and E vacuum is the total energy of the dye in vacuum. Calculations for the solvation energy were conducted using the universal solvation model based on density (SMD) variation of IEFPCM [72], which was useful for predicting the solvation energies of organic molecules [73] and calculating the relative hydrophobicity of modified squaraine dyes [70]. All DFT and TD-DFT calculations were conducted using the Gaussian16 software package [74].

Molecular Dynamics
Molecular dynamics (MD) simulations were performed with the GROMACS 2020.3 software package [75]. Dye-DNA structures were built using the UCSF ChimeraX software [76] with the dyes initialized on the outside of the DNA backbone. The OL15 forcefield [77] with non-bonded modifications [78] was used for DNA parameters and the generalized amber forcefield (GAFF) [79] was used for dye parameters. Atomic charges for the dyes were calculated using the HF/6-31G* theory level [80]. The 26-basepair dsDNA duplex sequence and dye locations from Huff et al. [29] was used. The dye-DNA structures were solvated in TIP3P water [81] in a truncated octahedron box with 1.2 nm between the dye-DNA structure and the box edge. Mg 2+ ions were used to neutralize the system. Cannon et al. and Huff et al. showed, experimentally, that by adding excess MgCl 2 to solutions containing DNA duplexes with two pentamethine indocyanine Cy5 dyes, DNA Holliday junctions could be formed [28,29]. Because of this, no excess MgCl 2 was used in the MD simulations apart from replacing a necessary number of water molecules with Mg 2+ to achieve neutral charge. Neighbor-searching was used with a cutoff of 1.2 nm. Van der Waals interactions were limited to 1.2 nm, and the particle mesh Ewald (PME) was used with a real-space coulomb cutoff of 1.2 nm. Bonds to hydrogen atoms were constrained using the LINCS algorithm [82]. A timestep of 2 fs was used.
The initial systems were energy-minimized with the steepest descent method for 1000 steps. Then, to achieve a well-relaxed starting structure, two subsequent 10 ns equilibration steps were performed with harmonic constraints, the first with 1000 kJ mol·nm 2 spring constants applied to non-hydrogen atoms, and the second with 100 kJ mol·nm 2 spring constants applied to non-hydrogen atoms, keeping the number of atoms, volume, and temperature constant. A final 10 ns equilibration was performed with no restraints. Following equilibration, 1 µs production simulations were carried out, keeping the number of atoms, pressure, and temperature constant. The velocity-rescale thermostat [83] was used to maintain a constant temperature of 300 K with a coupling time of 0.1 ps, with the DNA-dye and solvent being coupled separately. The Parrinello-Rahman barostat [84] was used to keep the pressure at 1 atm with a coupling time of 1.0 ps. Coordinates were written every 10 ps and the first 100 ns of the production simulations were treated as equilibration periods, and so were not used for analysis.
To determine the transition dipole coupling strength between two dyes, the orientation of the dyes with respect to one another and the transition dipole moments µ are needed. The dye-dye center-to-center distances (R m,n ) and orientation factors (κ) were determined every 10 ps. The values of κ were determined using [48] κ =μ m ·μ n − 3 R m,n ·μ m R m,n ·μ n whereμ i is the transition dipole moment unit vector of dye m or n (taken along the long axis of the dye), andR m,n is the unit vector between the centers of dyes m and n. When |κ| = 0 or |κ| = 1.5, the dyes are in a stacked oblique orientation or tail-to-tail oblique orientation, respectively. When |κ| = 1 or |κ| = 2, the dyes are in a stacked (H-aggregate) or head-to-tail (J-aggregate) orientation, respectively. The exciton exchange energy (J m,n ), which is a measure of the strength of the transition dipole coupling between two dyes, depends on the transition dipole moment µ, which is related to ε [10][11][12] and can be obtained experimentally or using TD-DFT. The J m,n of a dimer was approximated using the extended dipole model according to [85] where the values r i and s i correspond to either end of dye m or n along the dye's long axis. The pre-factor term, J 0 , is defined as [85] J 0 = µ m µ n 4π o n 2 l m l n (5) where µ i is the transition dipole moment magnitude (calculated using TD-DFT) of dyes m or n, 0 is the vacuum permittivity constant, n is the refractive index of water (1.33), and l m and l n are the lengths of dyes m and n, respectively (such that l i = |r i − s i |).

Dye Screening Using Machine Learning and Density Functional Theory
A Random Forest Classifier was trained on the development dataset utilizing a five k-fold to determine the best ε threshold; the max feature, which refers to the maximum number of features to consider; the max depth, which refers to the maximum depth of the tree; and the class weight for the model. Every model we created has a high accuracy of 97% or above. Figure 2a shows the accuracy of the Random Forest Classifier with various ε thresholds in comparison with a model (labeled Always Low ε) that always classifies a molecule with low ε. The Always Low ε model has a high accuracy of around 92% or above, indicating that the high accuracy of 97% or above for the Random Forest Classifier is reasonable. In addition, the accuracy of the Random Forest Classifier starts to converge at 150,000 M −1 cm −1 , which supports our selection of the threshold of 150,000 M −1 cm −1 . To address a concern of the effect of imbalanced data on the accuracy of the models, we also trained and validated the models with different datasets that included various percentages of molecules with low ε values. Figure 2b shows the accuracy vs. percentage of low ε in the dataset. The value of 0.5 represents the balanced dataset containing 50% molecules with high ε values and 50% molecules with low ε values. The number of molecules with low ε values gradually increases, so the dataset becomes imbalanced. However, the accuracy of the Random Forest Classifier remains at 97% or above. These results indicate that the effect of the imbalanced data we used to train and validate the models is negligible.
k-fold to determine the best ε threshold; the max feature, which refers to the maximum number of features to consider; the max depth, which refers to the maximum depth of the tree; and the class weight for the model. Every model we created has a high accuracy of 97% or above. Figure 2a shows the accuracy of the Random Forest Classifier with various ε thresholds in comparison with a model (labeled Always Low ε) that always classifies a molecule with low ε. The Always Low ε model has a high accuracy of around 92% or above, indicating that the high accuracy of 97% or above for the Random Forest Classifier is reasonable. In addition, the accuracy of the Random Forest Classifier starts to converge at 150,000 M −1 cm −1 , which supports our selection of the threshold of 150,000 M −1 cm −1 . To address a concern of the effect of imbalanced data on the accuracy of the models, we also trained and validated the models with different datasets that included various percentages of molecules with low ε values. Figure 2b shows the accuracy vs. percentage of low ε in the dataset. The value of 0.5 represents the balanced dataset containing 50% molecules with high ε values and 50% molecules with low ε values. The number of molecules with low ε values gradually increases, so the dataset becomes imbalanced. However, the accuracy of the Random Forest Classifier remains at 97% or above. These results indicate that the effect of the imbalanced data we used to train and validate the models is negligible.  Based on the performance of the Random Classifier model, we developed a Random Forest Regressor model to further predict the precise ε value of a molecule. We trained the Regressor to determine its best hyperparameters of class weight, max features, max depth, and criterion as we did with the Random Forest Classifier. Figure 3 compares the predicted and actual ε values for the development and validation datasets, which have 0.95 and 0.91 for R 2 (i.e., the coefficient of determination), respectively. The datasets and codes developed in this study will be published online.
To identify dyes with high ε (which is indicative of high µ) the optimized Regressor model was further applied to a dataset containing around 18,000 potential dye candidates, including the commercially available cyanine dyes such as Cy3, Cy5, Cy5.5, and Cy7, shown in Figure 4. These dyes are known to exhibit large ε (and thus, large µ and strong dye coupling) [86], and are of interest to our research group and collaborators [27]. Four modified Cy5 dyes with hydrophobic substituents were also considered in this study. These dyes, labeled as Cy5-Cl, Cy5-Peg, Cy5-hex, and Cy5-tBu, are also shown in Figure 4 and are hypothesized to exhibit stronger dye coupling due to being more hydrophobic compared to Cy5, which may result in shorter inter-dye distances [87]. Two other modified dyes, Cy5-CN and Cy5-NMe 2 , were considered since CN and NMe 2 were shown to have a large effect on excess dipole moments (the difference in the dipole moments of the ground and excited states) of similar dyes [45,46]. The rest of the dataset consisted of dyes obtained from PubChem [88], including dyes with similar structures to cyanine, porphyrin, and methyl violet molecules. Those three classes of dyes were chosen for their prominent π-conjugation [89][90][91], absorption in the visible light range [90,92,93], and excitonic applications [90,92,[94][95][96].
Based on the performance of the Random Classifier model, we developed a Random Forest Regressor model to further predict the precise ε value of a molecule. We trained the Regressor to determine its best hyperparameters of class weight, max features, max depth, and criterion as we did with the Random Forest Classifier. Figure 3 compares the predicted and actual ε values for the development and validation datasets, which have 0.95 and 0.91 for R 2 (i.e., the coefficient of determination), respectively. The datasets and codes developed in this study will be published online. To identify dyes with high ε (which is indicative of high μ) the optimized Regressor model was further applied to a dataset containing around 18,000 potential dye candidates, including the commercially available cyanine dyes such as Cy3, Cy5, Cy5.5, and Cy7, shown in Figure 4. These dyes are known to exhibit large ε (and thus, large μ and strong dye coupling) [86], and are of interest to our research group and collaborators [27]. Four modified Cy5 dyes with hydrophobic substituents were also considered in this study. These dyes, labeled as Cy5-Cl, Cy5-Peg, Cy5-hex, and Cy5-tBu, are also shown in Figure  4 and are hypothesized to exhibit stronger dye coupling due to being more hydrophobic compared to Cy5, which may result in shorter inter-dye distances [87]. Two other modified dyes, Cy5-CN and Cy5-NMe2, were considered since CN and NMe2 were shown to have a large effect on excess dipole moments (the difference in the dipole moments of the ground and excited states) of similar dyes [45,46]. The rest of the dataset consisted of dyes obtained from PubChem [88], including dyes with similar structures to cyanine, porphyrin, and methyl violet molecules. Those three classes of dyes were chosen for their prominent π-conjugation [89][90][91], absorption in the visible light range [90,92,93], and excitonic applications [90,92,[94][95][96].    While µ can be extracted from experimental absorption spectra [10][11][12], it is not possible to determine µ based on the peak ε alone. Because of this, TD-DFT was used to calculate µ, to compare with experimentally available ε. Figure 5 shows a comparison of ML-predicted ε and TD-DFT-calculated µ with commercially available Cy3, Cy5, Cy5.5, and Cy7 dyes, as well as modified Cy5 dyes for which experimentally measured values of ε are available. The ε values for the commercially available dyes were obtained from their respective commercial websites, including AAT Bioquest [97], Lumiprobe [98], Glen Research [99], and Interchim [100], and from Huff et al. [29]. Because multiple vendors advertise slightly different ε values, a range is given in Figure 5. Meares et al. synthesized Cy5-hex, Cy5-Peg, Cy5-tBu, and Cy5-Cl for incorporation into DNA and measured the ε of the dyes incorporated into DNA strands at their peak wavelengths [87]. The ranges of ε for Cy5, Cy5-hex, Cy5-Peg, Cy5-tBu, and Cy5-Cl are likely due to small differences in local environments when the dyes are attached to DNA sequences and relative purities [87]. In general, ML-predicted ε values agree with the trend of experimental ε obtained from literature. Notably, ML-predicted ε values for Cy5, Cy5.5, and Cy7 are within the experimental ε range indicated by the shaded region in Figure 5. Similarly, the ML-predicted ε trend agrees with the trend of TD-DFT-calculated µ values, which are assumed to be correlated (i.e., larger ε leads to larger µ). The dyes that do not fall into the range of experimental ε consist of the Cy5 derivatives developed by Meares et al. [87]. The percent errors from the experiments include 45% for Cy5-hex, 49% for Cy5-Peg, 10% for Cy5-tBu, and 20% for Cy5-Cl. Such differences could be caused by solvent, DNA-dye interactions, and dye purities [87]. Furthermore, the specific functional groups for Cy5-hex and Cy5-Peg might not be well represented by the ML training dataset, which could lead to inaccuracies when predicting ε. In general, the Regressor is able to predict the overall trend of ε, which is necessary for the screening of numerous new dyes for ε. While μ can be extracted from experimental absorption spectra [10][11][12], it is not possible to determine μ based on the peak ε alone. Because of this, TD-DFT was used to calculate μ, to compare with experimentally available ε. Figure 5 shows a comparison of ML-predicted ε and TD-DFT-calculated μ with commercially available Cy3, Cy5, Cy5.5, and Cy7 dyes, as well as modified Cy5 dyes for which experimentally measured values of ε are available. The ε values for the commercially available dyes were obtained from their respective commercial websites, including AAT Bioquest [97], Lumiprobe [98], Glen Research [99], and Interchim [100], and from Huff et al. [29]. Because multiple vendors advertise slightly different ε values, a range is given in Figure 5. Meares et al. synthesized Cy5-hex, Cy5-Peg, Cy5-tBu, and Cy5-Cl for incorporation into DNA and measured the ε of the dyes incorporated into DNA strands at their peak wavelengths [87]. The ranges of ε for Cy5, Cy5-hex, Cy5-Peg, Cy5-tBu, and Cy5-Cl are likely due to small differences in local environments when the dyes are attached to DNA sequences and relative purities [87]. In general, ML-predicted ε values agree with the trend of experimental ε obtained from literature. Notably, ML-predicted ε values for Cy5, Cy5.5, and Cy7 are within the experimental ε range indicated by the shaded region in Figure 5. Similarly, the ML-predicted ε trend agrees with the trend of TD-DFT-calculated μ values, which are assumed to be correlated (i.e., larger ε leads to larger μ). The dyes that do not fall into the range of experimental ε consist of the Cy5 derivatives developed by Meares et al. [87]. The percent errors from the experiments include 45% for Cy5-hex, 49% for Cy5-Peg, 10% for Cy5-tBu, and 20% for Cy5-Cl. Such differences could be caused by solvent, DNA-dye interactions, and dye purities [87]. Furthermore, the specific functional groups for Cy5-hex and Cy5-Peg might not be well represented by the ML training dataset, which could lead to inaccuracies when predicting ε. In general, the Regressor is able to predict the overall trend of ε, which is necessary for the screening of numerous new dyes for ε. The Regressor model was also applied to the dataset obtained from PubChem [88] to identify additional potential dye candidates. The top 100 dyes that were predicted to have The Regressor model was also applied to the dataset obtained from PubChem [88] to identify additional potential dye candidates. The top 100 dyes that were predicted to have ε above 150,000 M −1 cm −1 were then screened using DFT and TD-DFT to determine their µ values by calculating vertical excited state transitions to the lowest 30 excited states. Of the 100 dye candidates, the 15 dye candidates with desirable properties, such as absorption wavelength in the visible region, large π-conjugated networks, and µ comparable to that of Cy5 (within 50%), are shown in Figure 5 and are labeled 1-15. Their corresponding ML-predicted ε and TD-DFT-calculated µ values are listed in Table 1. Table 1. ML-Regressor predicted extinction coefficient (ε) and TD-DFT-calculated transition dipole moments (µ) of dyes shown in Figure 4. Cartesian µ vector components along x, y, and z axis are provided in Table S1. Comparing the TD-DFT calculated values of µ for Cy3, Cy5.5, Cy7, and the Cy5 derivatives, Cy3 has the lowest µ. Disregarding Cy5-hex and Cy5-Peg, the values of µ and the ML-predicted ε generally follow the same trend, with Cy3 having the smallest µ and Cy7 having the largest. Comparing the 15 selected dyes from the Regressor model, all dyes are predicted to have an ε above 210,000 M −1 cm −1 . Dye 1 has the largest overall ML-predicted ε of 309,000 M −1 cm −1 , with a TD-DFT µ of 9.08 D. Dye 3 has the largest overall TD-DFT µ of 20.25 D and an ML-predicted ε of 265,000 M −1 cm −1 . Figure 6 shows the log(P o/w ) values calculated with vacuum and implicit solvent DFT for the ML-selected dyes. A more positive value of log(P o/w ) indicates a more hydrophobic dye, and a more negative value of log(P o/w ) indicates a more hydrophilic dye. It is hypothesized that by increasing hydrophobicity, dyes may aggregate closer, thus improving coupling. This has been demonstrated in a set of squaraine dyes modified with hydrophobic substituents [70], and the values of log(P o/w ) for the hydrophobic squaraine dyes are similar to those for the Cy5 derivatives. Comparing the Cy5 derivatives, Cy3, Cy5.5, and Cy7, Cy5-hex is the most hydrophobic and Cy3 is the least hydrophobic, closely followed by Cy5-CN. Most of the 15 dyes chosen from the ML Regressor model predictions exhibit hydrophobicity similar to that of Cy5. Three dyes are hydrophilic (dyes 2, 4, and 13), with dye 13 being the most hydrophilic. These three dyes also exhibit relatively low µ values compared to the rest of the dataset, indicating that they may not be suitable for excitonic applications that require close inter-dye separations and large transition dipole moment couplings. Conversely, dyes 7, 12, and 15 exhibit the most positive log(P o/w ) values, meaning they are estimated to be the most hydrophobic. Furthermore, dyes 7, 12, and 15 have µ values within 25% of that for Cy5, making those dyes suitable for excitonic applications. Dyes 3 and 5 have log(P o/w ) values slightly larger than that of Cy5, and µ values about 5 D larger than Cy5. Overall, based on our criteria-a large ε (indicating large µ) and a large positive log(P o/w )-dyes 3 and 5 are the most promising candidates in the dataset for excitonic applications. esized that by increasing hydrophobicity, dyes may aggregate closer, thus improving coupling. This has been demonstrated in a set of squaraine dyes modified with hydrophobic substituents [70], and the values of log(Po/w) for the hydrophobic squaraine dyes are similar to those for the Cy5 derivatives. Comparing the Cy5 derivatives, Cy3, Cy5.5, and Cy7, Cy5-hex is the most hydrophobic and Cy3 is the least hydrophobic, closely followed by Cy5-CN. Most of the 15 dyes chosen from the ML Regressor model predictions exhibit hydrophobicity similar to that of Cy5. Three dyes are hydrophilic (dyes 2, 4, and 13), with dye 13 being the most hydrophilic. These three dyes also exhibit relatively low μ values compared to the rest of the dataset, indicating that they may not be suitable for excitonic applications that require close inter-dye separations and large transition dipole moment couplings. Conversely, dyes 7, 12, and 15 exhibit the most positive log(Po/w) values, meaning they are estimated to be the most hydrophobic. Furthermore, dyes 7, 12, and 15 have μ values within 25% of that for Cy5, making those dyes suitable for excitonic applications. Dyes 3 and 5 have log(Po/w) values slightly larger than that of Cy5, and μ values about 5 D larger than Cy5. Overall, based on our criteria-a large ε (indicating large μ) and a large positive log(Po/w)-dyes 3 and 5 are the most promising candidates in the dataset for excitonic applications.  Table S2. A more positive log(Po/w) means a molecule is more hydrophobic, and a more negative log(Po/w) means a molecule is more hydrophilic. The labels 1-15 and dye names correspond to the dye structures in Table 1.

Molecular Dynamics Simulations of Dye Aggregate-DNA Duplex Interactions
To study the effects of DNA on the dye orientations, 1 μs MD simulations were performed with two dyes covalently bound to the backbone of DNA duplexes via dual phosphoramidite linkers. In our study, we started with commercially available Cy5 and Cy5.5 as reference dyes to guide other dye candidate selection. Our research group experimentally demonstrated that Cy5 can exhibit aggregation, strong absorption, and excitonic coupling when attached to DNA duplexes and Holliday Junctions [27][28][29]. Due to its similar structure to Cy5, Cy5.5 should exhibit similar properties [38]. The extra aryl groups on Cy5.5 extend the conjugation and add to the size of the molecule, which could affect dyepacking and make μ . slightly larger than μ , as shown in Table 1. Furthermore, as shown in Figure 6, log(Po/w) for Cy5.5 is 40% larger than that for Cy5, indicating that Cy5.5 might pack closer than Cy5 when aggregated. Thus, we chose a Cy5 dimer and a Cy5.5 Figure 6. Partition coefficients of dyes in water versus n-octanol (log(P o/w )), calculated using Equation (1), where Gibbs free energies of solvation of dyes in implicit water and n-octanol solvents are provided in Table S2. A more positive log(P o/w ) means a molecule is more hydrophobic, and a more negative log(P o/w ) means a molecule is more hydrophilic. The labels 1-15 and dye names correspond to the dye structures in Table 1.

Molecular Dynamics Simulations of Dye Aggregate-DNA Duplex Interactions
To study the effects of DNA on the dye orientations, 1 µs MD simulations were performed with two dyes covalently bound to the backbone of DNA duplexes via dual phosphoramidite linkers. In our study, we started with commercially available Cy5 and Cy5.5 as reference dyes to guide other dye candidate selection. Our research group experimentally demonstrated that Cy5 can exhibit aggregation, strong absorption, and excitonic coupling when attached to DNA duplexes and Holliday Junctions [27][28][29]. Due to its similar structure to Cy5, Cy5.5 should exhibit similar properties [38]. The extra aryl groups on Cy5.5 extend the conjugation and add to the size of the molecule, which could affect dye-packing and make µ Cy5.5 slightly larger than µ Cy5 , as shown in Table 1. Furthermore, as shown in Figure 6, log(P o/w ) for Cy5.5 is 40% larger than that for Cy5, indicating that Cy5.5 might pack closer than Cy5 when aggregated. Thus, we chose a Cy5 dimer and a Cy5.5 dimer for MD simulations as a benchmark for comparison with future simulations of other selected dyes.
The simulations were performed in water at a 1 atm pressure and 300 K. Dye orientations were quantified using the orientation factor, κ, calculated using Equation (3). Dimer exciton exchange energies, |J|, were quantified using Equation (4) Table 1. The vectors corresponding to µ were found to primarily reside on the long axis of the dyes, and thus, the values of r i and s i in Equation (4) were chosen as the centers of the terminal aryl groups of the dyes.
The MD results for the Cy5 dimer attached to a DNA duplex are presented in Figures 7 and 8. Figure 7a shows a heatmap plot of |κ| versus R, and Figure 7b shows a heatmap plot of |J| versus R for the 900 ns of data collection. Based on the |κ| values shown in Figure 7a, there are two distinct dimer orientations, labeled "O1" and "O2". The approximate |J| regions corresponding to O1 and O2 are also shown in Figure 7b. As shown in Figure 7c, O1 corresponds to where the dyes are located outside of the duplex (i.e., nonintercalated). This orientation has a |κ| value ranging approximately from 0-1 (oblique and H-like aggregate) and an R of approximately 2.5-3.0 nm, which results in a relatively low |J| of less than 20 meV. Examining Figure 8 [29]. Furthermore, they determined the dimer to have a similar orientation, with a red-shift in the main absorption peak observed for the dimer relative to the monomer, indicating a mostly J-like orientation [29]. The ∼ 10 meV larger |J| value obtained from MD might be caused by a slight overestimation of µ Cy5 using TD-DFT, which is shown to be ∼ 2 D larger than the µ Cy5 obtained from experimental measurements [29,46].
Compared to the Cy5 dimer, the Cy5.5 dimer exhibits a similar trajectory. The Cy5.5 dyes were initialized, outside of the DNA duplex (i.e., non-intercalated). After about 400 ns, the dyes intercalated, reducing R and increasing |κ| (more J-like) and |J|. The two regions corresponding to the non-intercalated and intercalated states of the Cy5.5 dimer are labeled as "O1" and "O2" in Figure 9a,b, which are represented in the snapshots in Figure 9c,d. The O1 region of the Cy5.5 dimer has an R that ranges from about 2.2-3.2 nm. However, |κ| for the Cy5.5 dimer has a range of roughly 0-1.5, which is larger than that for the Cy5 dimer. However, |J| for the O1 region of the Cy5.5 dimer is between 0-15 meV, comparable to that of the Cy5 dimer. The ranges of R and |κ| of the O2 region of Cy5.5 are roughly 0.8-1.7 nm and 1.2-1.75, respectively, similar to that of the Cy5 dimer. This range of orientations results in a |J| range of about 30-60 meV, slightly smaller than that of the Cy5 dimer.
Averaging the dimer orientations past 400 ns (after which |κ| and R are relatively stable, as shown in Figure 10) results in |κ| = 1.33 ± 0.15 and R = 1.17 ± 0.17 nm. Similarly, averaging |J| over this time period results in |J| = 44.98 ± 11.6 meV. Even though the two dimers exhibit similar orientations, the smaller |J| for the Cy5.5 dimer might be caused by a small structure difference between Cy5.5 and Cy5. Cy5.5 is slightly larger and longer than Cy5 (by about 0.1 nm) due to the extra aryl groups at the two ends of the dye. Despite the smaller average R and similar orientations, the larger dye length of Cy5.5 compared to that of Cy5 results in a smaller pre-factor term J 0 , leading to the smaller |J|. In a similar study, it was found that Cy5.5 homodimers attached to transverse strands on a DNA Holliday junction exhibited closer dye distances and larger |J| values than Cy5 homodimers [38]. A potential reason for this difference might be that the DNA Holliday junctions are more flexible than the duplexes, which allow dye orientations to promote larger J values.   are labeled as "O1" and "O2" in Figure 9a,b, which are represented in the snapshots in Figure 9c,d. The O1 region of the Cy5.5 dimer has an R that ranges from about 2.2-3.2 nm. However, |κ| for the Cy5.5 dimer has a range of roughly 0-1.5, which is larger than that for the Cy5 dimer. However, |J| for the O1 region of the Cy5.5 dimer is between 0-15 meV, comparable to that of the Cy5 dimer. The ranges of R and |κ| of the O2 region of Cy5.5 are roughly 0.8-1.7 nm and 1.2-1.75, respectively, similar to that of the Cy5 dimer. This range of orientations results in a |J| range of about 30-60 meV, slightly smaller than that of the Cy5 dimer. Averaging the dimer orientations past 400 ns (after which |κ| and R are relatively stable, as shown in Figure 10) results in |κ| = 1.33 ± 0.15 and R = 1.17 ± 0.17 nm. Similarly, averaging |J| over this time period results in |J| = 44.98 ± 11.6 meV. Even though the two dimers exhibit similar orientations, the smaller |J| for the Cy5.5 dimer might be caused by a small structure difference between Cy5.5 and Cy5. Cy5.5 is slightly larger and longer than Cy5 (by about 0.1 nm) due to the extra aryl groups at the two ends of the dye. Despite the smaller average R and similar orientations, the larger dye length of Cy5.5 compared to that of Cy5 results in a smaller pre-factor term J , leading to the smaller |J|. In a similar study, it was found that Cy5.5 homodimers attached to transverse strands on a DNA Holliday junction exhibited closer dye distances and larger |J| values than Cy5 homodimers [38]. A potential reason for this difference might be that the DNA Holliday junctions are more flexible than the duplexes, which allow dye orientations to promote larger J values. Figure 10. Plots of (a) dye center-to-center distance (R), (b) orientation factor (|κ|), and (c) exciton exchange energy (|J|) versus time for the 900 ns Cy5.5 dimer-DNA duplex MD trajectory. The first 100 ns of the simulation were treated as an equilibration and are therefore excluded.

Discussion
Based on the results shown in Figure 5, the developed Regressor is useful for the prediction of experimental ε trends. This was shown by predicting the ε of a dataset of over 18,000 molecules, from which 15 were identified using a combination of Regressor predictions and TD-DFT calculations. In general, according to desired criteria the developed ML model can quickly and accurately screen a large dataset of dyes for optimum ε Figure 10. Plots of (a) dye center-to-center distance (R), (b) orientation factor (|κ|), and (c) exciton exchange energy (|J|) versus time for the 900 ns Cy5.5 dimer-DNA duplex MD trajectory. The first 100 ns of the simulation were treated as an equilibration and are therefore excluded.

Discussion
Based on the results shown in Figure 5, the developed Regressor is useful for the prediction of experimental ε trends. This was shown by predicting the ε of a dataset of over 18,000 molecules, from which 15 were identified using a combination of Regressor predictions and TD-DFT calculations. In general, according to desired criteria the developed ML model can quickly and accurately screen a large dataset of dyes for optimum ε for further study, which could save computation and experimental time.
There are improvements that can be made to the developed ML models. Primarily, more data would improve the chemical space on which the ML models would be trained, such as more chemical structures, as well as the inclusion of the solvents used. This is highlighted in Figure 1, which shows a disproportion in large versus small ε values in the training set. Furthermore, the present model could benefit from optimization and feature engineering, which could further reduce the computation time and help to improve the model by identifying more features that may better describe dye properties. Other types of ML models such as Support Vector Machine (SVM) algorithms could also be explored, which could potentially lead to better predictions. Finally, an improvement could be implemented by including human knowledge in the workflow so as to identify desirable structural features; this could be useful in interpreting data, improving efficiency, and enhancing model performance.
Examining Figure 5 and Table 1, the trends of the TD-DFT-calculated µ and the available experimentally determined ε mostly agree when considering the upper range of ε values. The deviations in ε compared to µ could be due to the solvent chosen for TD-DFT calculations (in our case, water) or the exchange-correlation functional used for the calculations. For example, our prior studies showed that the CAM-B3LYP functional produced slightly different µ values for similar dyes, however, the overall trend remained the same [46]. In the case of the Cy5 derivatives, the ε values might be affected by the specific DNA strand to which the dyes are attached, leading to the range in ε shown in Figure 5 for those dyes. This could also lead to some degree of disagreement with the TD-DFT calculations of µ (only considering free dyes in water) and the ML-predicted ε. Furthermore, in general, there is no strong correlation between the ML-predicted ε and TD-DFT µ for the 15 dyes shown in Table 1. This highlights the need for TD-DFT calculations as a validation for the initial dye screening using the ML Regressor, as the ML model is not 100% accurate. Studying the values of µ in Table 1, the main dye feature leading to a large µ is long conjugated chains, as is the case for dyes 3 and 5. The same trend is observed for Cy5 and Cy7, which have two and four more carbons in the conjugated chain compared to Cy3, respectively. Despite having similar structures, dye 10 has a~1 D larger µ compared to dye 8, which might be caused by the Cl atom bonded to the polymethine chain in dye 8.
Examining Figure 6, the Cy5 derivatives modified with hydrophobic groups exhibit an enhanced hydrophobicity (as exemplified by more positive log(P o/w ) values), and the trend of log(P o/w ) follows the same trend obtained by Meares et al. who used the Percepta Platform to predict log(P o/w ) based on chemical structure [87]. Since these dyes are derived from the Cy5 dye, which has been shown to form aggregates when attached to DNA scaffolds [27][28][29], they are promising candidates for further studies with an aim of enhancing |J|. Examining the log(P o/w ) for the 15 ML-selected dyes, most exhibit values similar to Cy5 and its derivatives, with the notable exceptions of dyes 2, 4, 7, 12, 13, and 15. Dyes 2, 4, and 13 have a negative log(P o/w ) and are, therefore, expected to be hydrophilic. Dye 2 is structurally similar to dyes 1, 7, and 12, which have positive log(P o/w ) values. The hydrophilicity of dye 2 might be attributed to the Se atoms in the ring, which dyes 1, 7, and 12 do not contain. Dyes 4 and 13 contain S groups, which are known to increase hydrophilicity [101]. Dyes 7, 12, and 15 stand out as the most hydrophobic of the dyes tested and all belong to the same family of porphyrins, which are known to be hydrophobic [102,103]. Based on their relatively high µ and enhanced hydrophobicity, dyes 7, 12, and 15 (porphyrin-based dyes) are promising candidates for dye-DNA applications. Furthermore, it has been shown that porphyrin dyes are suitable for bonding to DNA and are able to intercalate between bases, which may be beneficial for dye aggregation in DNA [102].
Our MD simulations show that the Cy5 dimer and Cy5.5 dimer orient similarly when attached to the same positions on a DNA duplex. Comparing the simulation results for the Cy5 dimer to Huff et al., our predicted |J| and R are within 20% and 5%, respectively [29]. It should be noted that the results shown by Huff et al. were obtained using a program developed by our group based on the Kuhn-Renger-May (KRM) theory, which only considers absorbance and circular dichroism spectra to obtain dye orientations [29]. However, the KRM-based program does not provide information on the details of the DNA-dye orientations or the impact of the DNA on the dye orientation as our MD simulations do. Based on the MD results, one possible mechanism is dye intercalation into the DNA duplex from outside of the DNA backbone to form a J-dimer, which occurs for both Cy5 and Cy5.5 simulations. Our results highlight the effectiveness of MD simulations for predicting dye orientations and the excitonic properties of dyes bound to DNA duplexes. For future studies, we will continue applying MD to additional ML-selected dye aggregates in DNA scaffolds.

Conclusions
ML models were developed through training and validation on a set of 8802 molecules to predict ε based on dye structures. A Random Forest Classifier was developed with an accuracy of 97%, and based on the performance of the Classifier, a Random Forest Regressor was constructed. Comparing ML-predicted ε from the Regressor and experimental ε, the Regressor was found to have a maximum R 2 of 0.95. Using the Regressor, the ε values of molecules in a dataset of around 18,000 were predicted, and the top 100 dyes were used in TD-DFT calculations to calculate µ. Overall, 15 dyes were identified to have a relatively large µ comparable to Cy5. MD simulations were conducted on reference dyes (Cy5 and Cy5.5) to determine the dye dimer orientations and transition dipole moment couplings, |J|. For Cy5, the MD simulations were able to predict dye orientations and |J| within 20% of the experiment. The Cy5.5 dimer yielded similar results to the Cy5 dimer. The successful use of the combined ML and DFT/TD-DFT screening to identify dyes with a large ε and µ highlights the effectiveness of our workflow to screen numerous dyes for desired properties. The agreement of our MD simulations with the experiment show that we can accurately detect dye dimer properties when attached to DNA scaffolds, which is crucial to guide dye design and synthesis for excitonic applications.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/molecules27113456/s1, Table S1: Transition dipole moment (µ) vector components of the dyes in Figure 4; Table S2: Gibbs free energies of solvation (∆G solv ) of the dyes in Figure 4 in water and n-octanol implicit solvents.

Informed Consent Statement: Not applicable.
Data Availability Statement: Select ML-predicted extinction coefficients, solvation energy, and transition dipole moment data presented in this study are available in the main text and in the Supplementary Materials. The data used to train and validate the machine learning models and the codes for the models will be published online soon.