An Expedited Route to Optical and Electronic Properties at Finite Temperature via Unsupervised Learning

Electronic properties and absorption spectra are the grounds to investigate molecular electronic states and their interactions with the environment. Modeling and computations are required for the molecular understanding and design strategies of photo-active materials and sensors. However, the interpretation of such properties demands expensive computations and dealing with the interplay of electronic excited states with the conformational freedom of the chromophores in complex matrices (i.e., solvents, biomolecules, crystals) at finite temperature. Computational protocols combining time dependent density functional theory and ab initio molecular dynamics (MD) have become very powerful in this field, although they require still a large number of computations for a detailed reproduction of electronic properties, such as band shapes. Besides the ongoing research in more traditional computational chemistry fields, data analysis and machine learning methods have been increasingly employed as complementary approaches for efficient data exploration, prediction and model development, starting from the data resulting from MD simulations and electronic structure calculations. In this work, dataset reduction capabilities by unsupervised clustering techniques applied to MD trajectories are proposed and tested for the ab initio modeling of electronic absorption spectra of two challenging case studies: a non-covalent charge-transfer dimer and a ruthenium complex in solution at room temperature. The K-medoids clustering technique is applied and is proven to be able to reduce by ∼100 times the total cost of excited state calculations on an MD sampling with no loss in the accuracy and it also provides an easier understanding of the representative structures (medoids) to be analyzed on the molecular scale.


Introduction
Photo-induced phenomena and optical properties are the grounds to investigate electronic states and their interactions with the environment . Experimental spectra can be interpreted via computational approaches at the molecular scale, understanding the microscopic characteristics that determine the position, width and shape of absorption bands [23][24][25][26][27][28][29][30][31][32][33][34]. However, a number of challenges remain open and mainly concern the modeling of either floppy molecules or non-covalent complexes in solution. Ideal approaches to deal with the complexity of the conformational freedom have to ensure an adequate sampling of the phase space of the potential energy surface (PES) at a given temperature, since such systems cannot be easily described by minimum energy structures as starting points for subsequent more computationally expensive calculations required to compute electronic transitions and excited state properties [34][35][36]. Molecular dynamics (MD) is the perfect technique for this goal since it can simultaneously describe the conformational freedom and the complexity of the environment (i.e., explicit solvent models) and can guarantee a satisfactory sampling of the phase space of the PES for selecting the initial states of the electronic transitions. On these bases, it is possible to reproduce the thermal fluctuations in a classical manner and simulate the shape of the electronic spectrum by classically considering the spreading of the vertical transitions of a representative sample of snapshots of the MD trajectory [37][38][39].
An accurate description of the electronic layout of a system is very important for the excited state properties and optical absorption. The electronic state separation, and the resulting UV-Vis absorption, strongly depend on the reference structure(s), indeed. A very detailed description of the PES ruling the system dynamics is demanded when standard force fields cannot by easily used, i.e., with non-covalent charge-transfer complexes [40][41][42][43][44][45][46][47][48], metal compounds or usually when the electronic density reorganization is involved during the time, even in the ground state. This is usually quite common also when an environment reorganization is involved as well. Since parameterized force fields cannot account for explicit electronic effects, an explicit treatment of electronic degrees of freedom is mandatory via ab initio methods. However, when reasonable large systems (≥1000 atoms) are studied, accurate wavefunction-based methods cannot be employed due to their high computational cost (above all, for excited state properties), although some progress has been recently achieved using graphical processing units [49] and localization procedures [50,51]. Thus, density functional theory (DFT) and time dependent (TD-) DFT, the latter required for excited state quantities, are usually chosen as a good compromise between accuracy and computational costs [10,[52][53][54][55][56][57][58][59][60][61][62].
Besides canonical computational chemistry fields, data analysis and machine learning (ML) methods have been increasingly employed as complementary approaches for an efficient data exploration, prediction and model development, starting from experimental data (structure, properties, reactivity) or from MD simulations and electronic structure calculations [63][64][65][66][67][68][69][70][71][72][73][74][75][76]. In particular, MD simulations often produce very big datasets (i.e., the collected trajectory in the phase-space), especially for long simulation times and extended systems, which can be difficult to manually inspect. Automated ML data analysis techniques thus can offer a valuable and efficient option to extract the significant and "physical" information from MD trajectories. In particular, unsupervised ML methods, such as clustering analysis, are able to partition a dataset according to similarities in some features space, employing only the input values and not requiring any output ones supplied by the user. Clustering proved a valuable tool for MD simulation analysis, allowing one to reduce the high number of sampled structures into a few representative ones, approximating conformational energy minima [77][78][79][80][81][82][83][84][85][86][87][88][89][90][91][92].
The simulation of electronic band-shapes at finite temperature through an MD sampling potentially requires hundreds of excited state calculations. Therefore, alternative routes such as the selection of a small number of representative frames could both reduce the computational cost of spectra calculations and simplify their interpretation [78,82]. In this work, dataset reduction capabilities via clustering techniques applied to MD trajectories in specifically tailored feature spaces were tested in the simulation of electronic absorption spectra of two model compounds. In particular, spectra computed only from the clusters' representative frames showed a remarkable reproduction of the main spectral features if compared to spectra from a uniform sampling of frames of the trajectory (a subset of ∼500 structures). This approach also allowed an easier interpretation of the calculated bands, which could result from many states close in energy but differing for their spatial properties.
Dataset reduction capabilities via unsupervised clustering techniques applied to MD trajectories are proposed and tested for the ab initio modeling of electronic absorption spectra of two challenging case studies. The first investigated model system is a prototypical π-stacked non-covalent dimer in dichloromethane (DCM) solvent (see Figure 1, left panel), consisting of an electron donor (1-chloronaphthalene, 1ClN) and an electron acceptor (tetracyanoethylene, TCNE). This represents a challenging case study with respect to evaluating the performance of ML clustering techniques in reproducing its electronic/optical properties, considering that the potential energy surface of weakly bound systems, ruled by dispersive intermolecular forces, is quite flat and numerous isoenergetic orientational isomers can be present in the solution. The TCNE:π:1ClN dimer has been thoroughly investigated in recent years by means of Femtosecond Stimulated Raman Spectroscopy [93] and through electronic structure methods for the detailed characterization of the ground state properties [35] and to unveil the nuclear relaxation upon photoexcitation downhill from the Franck-Condon region of the first charge-transfer state [34,94]. The second system is instead a Ru(II) complex, [Ru(dcbpy) 2 (NCS) 2 ] 4− (dcbpy = 4,4 -dicarboxy-2,2 -bipyridine) in water solution, also called "N3 4− " (Figure 1, right panel), which is a popular example of Ru-based dye sensitizers for solar cells and light-harvesting applications [95][96][97]. Light absorption by N3 4− in the visible region induced excitation to a dense manifold of metalto-ligand charge-transfer ( 1 MLCT) states. N3 4− photo-physical behavior is characterized by an ultrafast relaxation pathway among the singlet and triplet manifolds, induced by a complex interplay between closely spaced coupled electronic states, nuclear motion and solvent rearrangement [98][99][100][101][102][103][104][105], potentially influencing the dynamics and efficiency of the electron injection into a semiconductor substrate [106][107][108][109][110]. The N3 4− complex, both for its dense 1 MLCT manifold and its conformational dynamics in the solution [111], represents therefore another ideal model system for testing an efficient MD/ML clustering approach for the simulation of electronic spectra including finite-temperature effects.

The TCNE:π:1ClN Case Study
The massive amount of data acquired during an MD simulation requires analyses that are capable of going beyond the visual inspection of snapshots and average structures. In this regard, the statistical analysis of the trajectories through the calculation of the distribution functions represents an advantageous choice with respect to taking into account conformational dynamics, solute-solvent interactions and so on. In Figure 2, a threedimensional spatial distribution function (SDF) [112,113] is presented, computed along the 10 ps-long AIMD trajectory by considering the 1ClN as reference molecule (for which a local three points coordinate system was defined) and the center of the mass of the TCNE unit. From the SDF, it is observed that the TCNE remains on the same side of the ClN and slips on both rings during the exploration of the ground state PES, proving the presence of different mutual configurations and distances due to the weak Coulombic interactions that rule the π-stacked arrangement. According to the procedure presented in Section 3.2, the MD trajectory clustering thus yielded, for the TCNE:π:1ClN case study, five medoids (each one is representative of the corresponding cluster), characteristic of the accessible conformational space, in agreement with the SDF previously discussed. The clustering structural feature values shown by the five medoids, namely the rotation angle (θ r ) and the rotation axis (n r ) of the two subunits molecular planes and the position vector between the two geometric centers ( r N−E ), are collected in Table 1 (see also Section 3.2 for features definitions). A detailed analysis of medoid structural features reveals that they do not significantly differ in the rotation angle θ r between the two TCNE and 1ClN planes (values within a small range, ∼5 degrees), but mainly in the planes' relative orientation, as suggested by clearly differentn r axis components and secondarily in the TCNE-1ClN relative position. A closer inspection of the rotation axisn r components for medoids 2 and 4 reveals that they differ along the xand z-axes (see Table 1), while for medoids 3 and 5 only the component along the z-axis is reoriented for these latter. Additionally, analyzing the five medoids obtained from the clustering approach (side view presented in the right panel of Figure 3), considerable geometric deformations are present and the relative position of the molecular planes is also different. In order to acquire a visual representation of such clustering, the trajectory was projected onto the subspace of the features' first two principal components (PCs; please refer to Section 3.3 for technical details). A clear cluster separation was obtained, with each medoid representing a different portion of the conformational space (see Figure 4). The partial superposition of cluster 5 with cluster 3 in the principal components subspace is only an artifact (being instead separated in the full space) due to the reduced variance explained by the first two PCs (∼55.5% of the total variance). Table 1. Clustering feature values of the five cluster medoids from TCNE:π:1ClN trajectory. θ r : rotation angle (angle between versors normal to the two molecular planes, degrees),n r : rotation axis (versor normal to the former ones), r N−E : relative position vector (between 1ClN and TCNE geometric centers). Vector quantities are given as cartesian components (Å) in a fixed frame of reference.   . TCNE:π:1ClN trajectory in the features' first two principal components space. Cluster partition is represented through different colors. Cluster medoids are also highlighted (as star symbols). The color scheme adopted is kept fixed throughout this section.

Medoid
From Figure 3, the found medoids overall show pairwise structural similarities (see 2-4 and 3-5) if observed in a top-down direction; see Table 1 for a more quantitative evaluation. The conformational flexibility of the TCNE:1ClN π-stacked complex, which is due to the weak dispersion forces, is thus fully captured with the trajectory clustering approach.
The UV-Vis spectrum comprising the first low-lying singlet states computed for each medoid within linear response TD-DFT formalism is reported in Figure 5. The estimation of the whole electronic spectrum (red curve) was obtained according to Equation (5) and the procedure explained in Section 3.4. A comprehensive analysis of the electronic spectrum has been recently provided by some of the authors in Ref. [35]. On the other hand, we report a detailed summary of the characterization of the transitions towards the S 1 and S 2 excited states in Table 2. We recall that the weaker electronic transitions below 4.00 eV have a charge transfer (CT) nature (for S 1 and S 2 see ω CT charge transfer descriptor parameter in Table 2). Conversely, the very bright ones are characterized by electronic density reorganization occurring in the same molecular unit, hence they are of a local excitation (LE) character. For medoid 1, the TCNE is located on an edge of the 1ClN ring and it mainly contributes to the absorption bands above 2.50 eV (see light green curve in Figure 5), while for the S 0 -S 1 electronic transition at 1.807 eV characterized by a strong CT nature (ω CT = 0.968) the probability is negligible, f = 0.002. For medoids 2 and 4, the TCNE lies on the ring bearing the chlorine atom and they share roughly the same electronic properties in terms of transition probability and energy range (see orange and magenta curves in Figure 5, respectively). Both show absorption bands in all regions of the spectrum. In this case, the first two states S 1 and S 2 of both medoids contribute, respectively, to the bands at ∼2.00 and 2.80 eV. Also in these cases, the S 1 and S 2 states are characterized by a strong charge transfer nature as can be easily deduced from the values of the ω CT descriptor close to unity, reported in Table 2. In medoids 3 and 5, the TCNE is placed on the unfunctionalized six-membered ring of the 1ClN and we observe that the electronic properties show considerable differences. The electronic features of medoid 3 (violet curve in Figure 5) cover the entire spectral range considered, 1.50-5.00 eV, while only high energy electronic transitions (>3.50 eV) are bright for medoid 5 (dark green curve in Figure 5). Comparing the spectrum from the five medoids to that from the complete MD sampling, an excellent agreement is observed ( Figure 6, top panel). The experimental optical spectrum profile in solution ( Figure 6, bottom panel) shows two distinct absorption bands with maxima centered at 408 nm (3.04 eV) and 537 nm (2.31 eV), as well as the calculated spectrum. In particular, the first calculated band at ∼2.00 eV has contributions from the S 1 states of representative frames 4, 2 and 3, each having, in turn, a clear 1ClN → TCNE charge-transfer character. Analogously, the second band at ∼2.80 eV appears constituted by the S 2 CT states of medoids 1, 4, 3 and 2.  Such a case study proves the clustering technique to be an efficient way to estimate the electronic spectrum at finite temperature, avoiding excited state calculations on a large number of frames (for TCNE:π:1ClN model system, a 100-fold decrease in total computational cost). Moreover, the medoid excited state characterization ( Table 2) allows one to perform a more accurate spectral assignment of the absorption bands. This further confirms that the cluster medoids, taken as representative frames, can efficiently resume the collected conformational dynamics.

The N3 4− Case Study
The N3 4− dynamics at room temperature in water solution is characterized, on the one hand, by the rigidity of the dcbpy ligands, due to the chelation to the Ru center and, on the other hand, by the flexibility of the NCS − ligands, exploring conical-shaped regions (please see Figure 1, right panel, to recall the system under investigation). The vibrational dynamics induce therefore instantaneous deviations from the ideal C 2 symmetry, which could improve the transition probability of otherwise dark excited states [111]. The clustering procedure applied to the collected N3 4− trajectory suggested a partition into seven distinct clusters. Projection into the two-dimensional principal component subspace (actually accounting for 56.9% of the total variance) shows indeed a quite clear separation between the clusters and the medoids representing them (Figure 7). Again, the observed partial superposition could be a spurious effect of data visualization through a low-dimensional PCA. According to the feature values shown by the cluster medoids, these representative structures (reported in Figure 8) actually seem to capture both the conformational (torsional) freedom of the coordinated NCS − ligands (φ 1 and φ 2 torsional angles) and the different degrees of asymmetry sampled by the N3 4− dynamics (Table 3, please refer also to Section 3.2 for N3 4− features definitions). In particular, the values of continuous symmetry measure of deviation from C 2 symmetry (C 2 -CSM, Section 3.2) most sampled by the MD trajectory (distribution maxima at 0.09, 0.17, 0.22, 0.31, Figure 9) are close to the values by the cluster medoids, further confirming the representation capabilities of the latter.     Electronic absorption spectra of transition metal complexes are determined by several, closely spaced, excited states, differing in their spatial properties (i.e., metal and ligandlocalized transitions, metal-to-ligand (ML) and ligand-to-metal (LM) charge-transfer (CT) transitions). From a practical point of view, this implies the computation (and characterization) of a high number of excited states with some level of theory to simulate the spectrum in a given energy range. Therefore, a clustering analysis performed on an MD trajectory (and so reducing the complete configuration dataset to a few, representative, structures) potentially appears even more convenient for the simulation and the interpretation of transition metal complex electronic spectra including finite-temperature effects.
The N3 4− electronic spectrum was simulated up to ∼3.7 eV, comprising the two experimentally characterized bands at ∼2.50 and ∼3.36 eV [114]. The spectra calculated for each cluster medoid actually slightly differ in the absorption band positions (energies) and intensities, since the medoids represent different regions of the accessible conformational space ( Figure 10). In particular, the spectrum obtained from the only seven representative frames can actually quite well reproduce that from the complete MD sampling at T = 298 K in water solution (Figure 11), although with an increased sub-structure, due to the lower number of frames involved in the spectrum calculation. The selection of representative frames through a clustering analysis allowed one therefore to achieve a remarkable ∼70-fold decrease in the total computational cost for N3 4− electronic spectrum simulation.  Especially for transition metal complexes, the observed absorption bands can each be the result of many close transitions. Dataset reduction through a clustering analysis allowed an accurate N3 4− spectral characterization, which could be otherwise difficult to perform. In particular, the calculated band at 2.07 eV ( Figure 11) results from medoid 1 S 2 , medoid 2 S 2 , medoid 7 S 1 , medoid 6 S 1 and medoid 5 S 2 states, which are mainly Ru→(dcbpy) 2 (Ω RP ≈ 0.55, Ω SP ≈ 0.25, Table 4) CT states. Analogously, medoid 6 S 5 , medoid 7 S 5 , medoid 2 S 6 , medoid 1 S 5 and medoid 3 S 5 , with similar metal-to-ligand charge-transfer (MLCT) spatial features, contribute to the more intense calculated band at 2.39 eV. The higher-energy bands are characterized instead by a less homogeneous set of excited states. In fact, the calculated band at 3.23 eV results from medoid 2 S 8 , medoid 7 S 13 Ru→(dcbpy) 2 states (Ω RP ≈ 0.55, Ω SP ≈ 0.25, Table 4), medoid 2 S 18 Ru→(dcbpy) 2 state, but with an increased dcbpy localized-excitation character (Ω RP ≈ 0.40, Ω PP ≈ 0.30), medoid 4 S 15 Ru→(dcbpy) 2 state, with increased (NCS) 2 donor contribution (Ω RP ≈ 0.50, Ω SP ≈ 0.40) and medoid 7 S 19 state, which is mainly an (NCS) 2 →(dcbpy) 2 CT state (Ω SP ≈ 0.40, Ω RP ≈ 0.30). The close calculated 3.38 eV band has instead a quite different average character. In fact, the contributing medoid 1 S 40 state is mostly an (NCS) 2 →(dcbpy) 2 CT state (Ω SP ≈ 0.60), medoid 2 S 37 and medoid 5 S 33 states have an increased localized character (Ω SP ≈ 0.40, Ω PP ≈ 0.30), while medoid 3 S 21 and medoid 6 S 34 are localized excitations on dcbpy ligands (Ω PP ≈ 0.60 and ≈ 0.50, respectively).

Ab Initio Molecular Dynamics
The conformational flexibility of the TCNE:π:1ClN and N3 4− model systems were sampled through ab initio molecular dynamics simulations. In particular, the Atomcentered Density Matrix Propagation (ADMP) method was employed: the density matrix in an orthonormalized atomic basis is included in an extended Lagrangian as an additional degree of freedom and propagated together with the nuclear degrees, avoiding a selfconsistent procedure at each step [115][116][117][118][119].
The N3 4− system was simulated instead for 8.6 ps with a 0.1 fs time step [111]. A velocity rescaling every 1 ps allowed to keep a 298 K temperature. Explicit water solvation was included in the N3 4− ground state sampling, in order to better model the specific solute-solvent interactions at the several solvation sites. A 22 Å-radius spherical solvent box (∼1500 molecules) was extracted from a pre-equilibrated cubic one and placed around N3 4− . A hybrid quantum mechanics/molecular mechanics potential was employed: B3LYP/def2-SVP [135] for the QM portion (the N3 4− molecule) with associated electronic core potential for the Ru atom [136] and the TIP3P water model [137] for the MM part (the water spherical box), re-parametrized to allow a bending motion [11]. The QM and MM potentials were combined through the ONIOM QM/MM scheme [138][139][140], including the MM charges into the QM hamiltonian (i.e., an "electronic embedding"). General AMBER Force Field [141] atom types (and so van der Waals non-bonding parameters) were assigned, moreover, to N3 4− atoms. Non-periodic boundary conditions were introduced through a hybrid explicit/implicit solvent model. Long-range electrostatic effects and short-range dispersion-repulsion forces between the explicit and the bulk solvent were, respectively, modeled through C-PCM self-reaction field and an empirical confining potential, which has to be parametrized for the specific solvent model [10,124,[142][143][144]. We refer the reader to previous works for more details about the ab initio molecular dynamics simulations of the model systems and the employed potentials [34,94,111].

Feature Selection and Clustering of Molecular Dynamics Trajectories
Due to its large dimensions, it is often useful to transform the original dataset of the collected, N-frames long, trajectory (the configurations, i.e., the positions of each of the N at atoms in the system, at each time step) into a matrix X ∈ R N×d , representing the data in some d-dimensional (d 3N at ) feature space, different from the coordinate space. The chosen features should adequately describe the properties of interest, without much loss of information [77]. Internal coordinates, such as bonds, angles and dihedrals or more specifically tailored parameters, according to the problem under study, can be employed as features.
In particular, the TCNE:π:1ClN trajectory was transformed into a feature space able to describe the orientation of the two molecular planes and the relative position of the two molecules, comprising the angle ([0, 180 • ]) between the versors normal to the TCNE and ClN planes, the versor representing the axis of rotation of the two planes (i.e., the versor orthogonal to the former ones) and the relative position vector (i.e., the vector between the two geometric centers). For N3 4− , instead, a continuous symmetry measure of deviation from the ideal C 2 symmetry [145][146][147], calculated as the minimized root-mean-square deviation from the images generated through the C 2 symmetry operations, was considered.
In particular, C 2 -CSM was evaluated on the smallest subset of N3 4− atoms showing a symmetry not higher than C 2 , as the complete molecule. Since the non-linearity of the NCS − coordination in the water solution (C(NCS)-N(NCS)-Ru angle less than 180 • ) and their torsional mobility were previously recognized [111], the C(NCS)-N(NCS)-Ru-N(dcbpy) dihedrals describing the NCS − orientations were also included. In this regard, to avoid problems due to the periodicity around ±180 • , each dihedral φ was included as a (cos(φ), sin(φ)) pair to keep a metric feature space [148]. The MD datasets in the feature space X were standardized (i.e., shifted to zero mean and scaled to unit variance) before following analyses.
Clustering machine learning techniques allow one to partition a dataset, grouping similar instances according to a similarity measure, such as a metric (for instance, Euclidean) in the feature space [149]. Instances within a cluster should be similar to each other and different from those belonging to the other clusters. In K-Means [150] and K-Medoids [151][152][153] approaches, for a given number K of clusters, K cluster centers are obtained. The feature space is partitioned (tessellated) by assigning each instance to the closest center. The latter are found by minimization of a loss function, defined as the sum of the squared distances between each instance and the cluster center to which it is assigned: where x i belongs to the cluster k, c k is the corresponding center in the feature space and N is the number of "observations" (trajectory frames). While in the K-Means algorithm the cluster center is the mean of the cluster members and so does not have to correspond to any instance x i of the dataset, in the K-Medoids approach it is forced to be some x i , such that the sum of the squared distances from the cluster members is the lowest (like a median).
MD trajectories of TCNE:π:1ClN and N3 4− model systems in their respective feature spaces were clusterized with the K-Medoids algorithm [152], since the cluster medoids, which are representative of the corresponding clusters, are trajectory frames themselves and, compared to K-Means centroids, should be less sensitive to possible outliers [151].
The optimal number of clusters K was chosen searching for an "elbow" (i.e., a slope change) in the plot of the inertia parameter (i.e., the minimized value of Equation (1)) as a function of K and evaluating the Calinski-Harabasz index [154], which is the ratio of between-cluster and within-cluster dispersions, being higher for a better clusterization into compact and separated clusters.

Dimensionality Reduction for MD Data Visualization
Principal component analysis (PCA) [155] is a popular dimensionality reduction technique. For some centered (i.e., zero-mean) data matrix X, its principal components v j are the eigenvectors of its covariance matrix C: It can be shown that v 1 (the eigenvector corresponding to the largest eigenvalue λ 1 ) is the direction along which the variance of the data is highest, v 2 is the direction of highest variance in the subspace orthogonal to v 1 , etc., while the eigenvalues λ j are the variance of the data along each v j . Projection of the data on the subspace of the first d r ≤ d principal components can be performed via the following: where ∑ d r j λ j is the variance retained in the PC subspace. PCA dimensionality reduction (d r = 2) of TCNE:π:1ClN and N3 4− trajectories was performed only for data visualization purposes on two-dimensional plots and not as a pre-processing step for clustering analysis. In fact, the dimensionalities of their respective feature spaces (Section 3.2) are actually quite small, likely not involving any "curse of dimensionality" issues.

Excited State Characterization and Spectra Simulations
TCNE:π:1ClN and N3 4− excited states were computed with the linear-response TD-DFT approach at CAM-B3LYP/GD3/C-PCM(DCM)/6-31+G(d,p) and B3LYP/C-PCM(water)/def2-SVP/SDD(Ru) levels of theory, respectively. Electronic spectra in the solution at T = 298 K were simulated on 500 frame subsets of the collected MD samplings (i.e., every 20 fs and 17.2 fs, respectively). The first 8 and 40 singlet excited states were calculated for TCNE:π:1ClN and N3 4− , respectively. The complete spectra were obtained by summation of Gaussian-shaped contributions over each frame and each calculated excited state: where f li and ν li are the oscillator strength and excitation energy of the i-th state of l-th frame and σ is a width parameter, set at σ 2 = 0.001 eV 2 . Spectra estimated from the only cluster medoids were similarly calculated: where K is the number of clusters and p k is the k-th cluster population. Cluster medoid excited states were further characterized by fragment-based transition density Löwdin population analysis and related charge transfer descriptors, calculated with the TheoDORE package [156,157]: where A and B are two molecular fragments and D 0i is the transition density matrix for the S i ← S 0 excitation. Ab initio molecular dynamics simulations and excited state calculations were performed with the Gaussian16 software package [158].

Conclusions
Unsupervised clustering methods have been employed as complementary approaches for an efficient exploration of the data resulting from MD simulations and electronic structure calculations. In this work, MD dataset reduction capabilities via unsupervised clustering techniques were applied for the ab initio modeling of electronic absorption spectra of the non-covalent charge-transfer TCNE:π:1ClN dimer and the [Ru(dcbpy) 2 (NCS) 2 ] 4− complex in solution at room temperature. Cluster medoids, taken as representative structures, were found and analyzed in terms of main structural parameters, principal component dynamics, electronic excitations and charge transfer indices, showing how such medoids can satisfactorily cover the system dynamics and optical properties with a very good agreement with experiments.
The simulation of electronic absorption spectra usually demands expensive computations and requires dealing with the interplay of electronic excited states with the conformational freedom of the chromophores in complex matrices (i.e., solvents, biomolecules, crystals) at finite temperature. This work highlights the power of the unsupervised K-medoid clustering technique combined with a tailored selection of the feature space in reducing by ∼100 times the total cost of electronic and optical property computations on an MD sampling with no loss of accuracy and in preserving the molecular interpretation via the cluster medoids. In this regard, it could be very interesting to study how the medoids and the several conformational minima are related and this is a subject for further spectroscopic and weighting scheme developments.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: 1ClN 1-chloronaphthalene ADMP Atom-centered density matrix propagation C-PCM