Open Access
This article is
 freely available
 reusable
Algorithms 2020, 13(2), 30; https://doi.org/10.3390/a13020030
Article
Learning Manifolds from Dynamic Process Data †
^{1}
Department of Computer Science & Engineering, University at Buffalo, Buffalo, NY 14260, USA
^{2}
Department of Computational and DataEnabled Science & Engineering, University at Buffalo, Buffalo, NY 14260, USA
^{3}
Department of Materials Design & Innovation, University at Buffalo, Buffalo, NY 14260, USA
^{4}
Department of Biomedical Informatics, University at Buffalo, Buffalo, NY 14260, USA
*
Correspondence: [email protected]
^{†}
This paper is an extended version of our paper published in Proceedings of the 2018 IEEE International Conference on Big Data, Seattle, WA, USA, 10–13 December 2018.
Received: 30 September 2019 / Accepted: 14 January 2020 / Published: 21 January 2020
Abstract
:Scientific data, generated by computational models or from experiments, are typically results of nonlinear interactions among several latent processes. Such datasets are typically highdimensional and exhibit strong temporal correlations. Better understanding of the underlying processes requires mapping such data to a lowdimensional manifold where the dynamics of the latent processes are evident. While nonlinear spectral dimensionality reduction methods, e.g., Isomap, and their scalable variants, are conceptually fit candidates for obtaining such a mapping, the presence of the strong temporal correlation in the data can significantly impact these methods. In this paper, we first show why such methods fail when dealing with dynamic process data. A novel method, EntropyIsomap, is proposed to handle this shortcoming. We demonstrate the effectiveness of the proposed method in the context of understanding the fabrication process of organic materials. The resulting lowdimensional representation correctly characterizes the process control variables and allows for informative visualization of the material morphology evolution.
Keywords:
manifold learning; time series; dynamic processes1. Introduction
Scientific data, either produced by complex numerical simulations or collected by highresolution scientific instruments, are typically characterized by three salient features: (i) massive data volumes; (ii) high dimensionality; and (iii) the presence of strong temporal correlation in the data. Representation of this big process data in a lowdimensional (2D or 3D) space can reveal key insights into the dynamics of the underlying scientific processes at play. Here, the term process data means any data that represent the evolution of some process states over time (see Figure 1). Indeed, most of these highdimensional datasets are generated through an interplay of a few physical processes. However, such interactions are typically nonlinear, which means that linear dimensionality reduction methods, such as principal component analysis (PCA), are not applicable here. Instead, one needs to resort to nonlinear methods which assume that the nonlinear processes can be characterized by lowdimensional submanifolds, and by “mapping” the data onto such manifolds, one can understand the true behavior of the physical processes in a lowdimensional representation.
Our focus in this work was to develop a novel method for dimensionality reduction of process data, which can handle the above listed challenges. Input data is typically large, as each sample of a process delivers a time series of highdimensional points, which rules out many nonlinear dimensionality reduction methods. At the same time, the presence of strong temporal correlation among data points that belong to the same time series confuses many dimensionality reduction methods which operate under the assumption that the input data is uniformly sampled in the manifold space.
The key motivation behind this work arises from the massive and highdimensional datasets that are produced by computational models that simulate material morphology evolution during the fabrication process of organic thin films (see Section 5.1), by solving nonlinear partial differential equations. Organic thin film fabrication is a key factor that controls the properties of organic electronics, such as transistors, batteries, and displays. However, this requires expensive search campaigns of the design space. Choice of the fabrication parameters impacts the trajectory that the process follows, which eventually determines the properties of the material being fabricated. Scientists and engineers are interested in using dimensionality reduction on the resulting big data to explore the material design space, and optimize the fabrication to make devices with desired properties.
In our prior work, we proposed SIsomap, a spectral dimensionality reduction technique for nonlinear data [1] that can scale to massive data streams. While this method can efficiently and reliably process highthroughput data streams, it assumes that the input data are weakly correlated. Consequently, it fails when applied directly to process data. SIsomap was derived from the standard Isomap algorithm [2], which is frequently used and favored in scientific computing data analysis [3,4,5,6,7,8]. Unfortunately, while there is some prior work on applying Isomap to spatiotemporal data [9], the focus has been on segmentation of data trajectories rather than discovering a continuous latent state. In our recent work, we proposed a new spectral method to handle highdimensional process data [10].
This paper significantly expands on our past work [10] and makes two key contributions. The first contribution is to show how the standard linear and nonlinear dimensionality reduction methods, e.g., PCA, Isomap, etc., fail when dealing with process data. The poor performance can be attributed to the fact that every observation is highly correlated with the temporally close observations belonging to the same trajectory, and there is a lack of mixing (or crosstalk) between different trajectories of a process. The second contribution of this paper is a new method, EntropyIsomap, which induces the crosstalk between different trajectories by adaptively modifying the neighborhood size for every data instance. The proposed method is both easy to implement and effective, and could likely be applied to other spectral dimensionality reduction methods, besides Isomap.
2. Background and Related Work
In this section we provide a gentle introduction to the spectral dimensionality reduction methods and the process data encountered in our target application, and also discuss some related techniques discussed in the literature for these topics.
2.1. Spectral Dimensionality Reduction Methods
Spectral dimensionality reduction (SDR) refers to a family of methods that map highdimensional data to a lowdimensional representation by learning the lowdimensional structure in the original data. SDR methods rely on the assumption that there exists a function $f:{\mathbb{R}}^{d}\to {\mathbb{R}}^{D}$, $d\le D$, that maps lowdimensional coordinates, ${y}_{i}\in {\mathbb{R}}^{d}$, of each data sample to the observed ${x}_{i}\in {\mathbb{R}}^{D}$. The goal then becomes to learn the inverse mapping, ${f}^{1}$, that can be used to map highdimensional ${x}_{i}$ to lowdimensional ${y}_{i}$. While different methods exist within this family, they all share a common computing pattern. For a given set of points, $\mathbf{X}$, in a highdimensional space ${\mathbb{R}}^{D}$, SDR methods compute either top or bottom eigenvectors of a feature matrix, $\mathbf{F}$ derived from the $n\times D$ data matrix, $\mathbf{X}$, where n is the number of data instances. Here, the feature matrix, $\mathbf{F}$, captures the structure of the data through some selected property (e.g., pairwise distances).
Two broad categories of SDR methods exist, based on the assumption they make about f (i.e., linear vs. nonlinear). Linear methods assume that the data lie on a lowdimensional subspace ${\mathbb{V}}^{d}$ of ${\mathbb{R}}^{D}$, and construct a set of basis vectors representing the mapping. When working with the linearity assumption, the most commonly used methods are PCA and multidimensional scaling (MDS) [11]. PCA learns the subspace that best preserves covariance, i.e., F is the $D\times D$ sample covariance matrix of the input data. The basis vectors learned by PCA, known as principal components, are the directions along which the data has highest variance. In the case of MDS, the feature matrix is the $n\times n$ dissimilarity matrix that encodes some pairwise relationships between data points in $\mathbf{X}$. When these relationships are Euclidean distances, the result is equivalent to that of PCA, and this is known as classical MDS. Spectral decomposition of F in both methods yields eigenvectors Q. Taking the top d eigenvectors, the data can be mapped to lowdimensions as $\mathbf{Y}$ by the transformation Y = X${\mathbf{Q}}_{d}$.
However, in cases where the data is assumed to be generated by some nonlinear process, the linearity assumption is too restrictive. In such cases, both PCA and MDS are not robust enough to learn the inverse mapping ${f}^{1}$. Although variants of PCA have been proposed to address such situations (e.g., kernel PCA [12]), their performance is sensitive to the choice of the kernel and the associated parameters. Instead, the most common approach is to use a manifold learningbased nonlinear SDR (or NLSDR) technique, such as Isomap [2].
NLSDR techniques can be divided into two categories: global and local. Global methods, e.g., Isomap, minimum volume embedding [13], preserve a global property of the data. On the other hand, local methods, e.g., local linear embedding (LLE) [14,15], diffusion maps [16], Laplacian eigenmaps [17], preserve a local property for each data instance. All of these methods, however, involve a series of similar data transformations. First, a neighborhood graph is constructed, in which each node is linked with its k nearest neighbors. Next, this neighborhood graph is used to create a feature matrix which characterizes the property that the underlying algorithm is trying to preserve. For Isomap, the feature matrix is obtained as the shortest path graph in which every pair of data instances is connected to each other by the shortest path between them. The lowdimensional representation of the input data is obtained by factorization of the feature matrix. Typically, first d eigenvectors/values form the output Y. The steps of the Isomap algorithm are outlined in Algorithm 1.
Algorithm 1 ISOMAP 
Input:$\mathbf{X}$, k 
Output:$\mathbf{Y}$ 

2.2. Dynamic Process Data
If the rows in the input data matrix, $\mathbf{X}$, are uniformly sampled from the underlying manifold, methods such as Isomap are generally able to learn the manifold, in the presence of sufficient data. However, here, we consider the scenario where $\mathbf{X}$ represents a dynamic process, i.e., instances in $\mathbf{X}$ are partitioned into T trajectories, ${\Gamma}_{1},{\Gamma}_{2},\dots ,{\Gamma}_{T}$. ${\Gamma}_{I}$ represents one trajectory which is specified by a $\tau $parameterized sequence of ${m}_{I}$ data points. Thus, the sum of the lengths of the T trajectories is equal to n, i.e., ${\sum}_{T}{m}_{I}=n$. In other words, ${\Gamma}_{I}$ = (${x}_{I}\left({\tau}_{1}\right)$, ${x}_{I}\left({\tau}_{2}\right)$,…, ${x}_{I}\left({\tau}_{{m}_{I}}\right)$), where ${\tau}_{i}<{\tau}_{j}$ when $i<j$. Parameter $\tau $ usually denotes time, and trajectory ${\Gamma}_{I}$ can be a function of one or more additional parameters.
The target application for this study is the study of material morphology evolution during the fabrication of organic thin film [18]. In particular, we use process data produced by the numerical simulation of the morphology evolution. Each input trajectory (denoted as $\Gamma (\varphi ,\chi )$, from here on) corresponds to a simulation which is parameterized by two fabrication process variables, viz.: (i) $\varphi $, or the blend ratio of polymers making organic film; and (ii) $\chi $, or the strength of interaction between these polymers. Each trajectory consists of a series of “images”, generated over time, where each image is actually a 2D morphology snapshot that is produced by the numerical simulation for a given time step. A simple preprocessing step transforms the image to a highdimensional vector, ${\mathbb{R}}^{D}$. See Figure 1 for some examples of the images that are generated for a selected combination of $\varphi $ and $\chi $. A detailed description of the data and the data generation process is provided in Section 5.
Ideally, to understand the impact of the input parameters, $\varphi $ and $\chi $, on the final properties of the material, a large number of simulations need to be run that can uniformly span the domains of $\varphi $ and $\chi $. However, the computational cost associated with generating one trajectory, corresponding to a single combination of $\varphi $ and $\chi $, means that the sampling is sparse. On the other hand, sampling in the time dimension is relatively dense. Instances belonging to the same trajectory tend to be strongly correlated, which is reflective of how the morphologies evolve. These factors strongly influence the connectivity of the neighborhood graph, G, which strongly impacts the approximation of the manifold distances by the Isomap algorithm.
2.3. Related Works in Dealing with Dynamic Processes
A significant body of work exists in the field of developing reduction strategies for highdimensional data generated via complex physical processes. In fact, linear projection methods are often used to construct such reduced order models from highdimensional data [19]. Nonlinear counterparts for such settings have been explored in limited settings, for certain types of dynamical systems [20,21,22,23]. In particular, several of these methods have adapted diffusion maps [16] to dynamical process data, by considering a fixed length “timewindow” around each sample of the process data to define a local neighborhood, which can account for moderate noise at short temporal scales. However, most of these methods focus on understanding a single dynamical process, and not necessarily a collection of trajectories obtained by varying the parameter settings.
One recent work proposes a datadriven method for organizing temporal observations of dynamical systems that depend on the system parameters [24]. In that work, the authors simultaneously compute lower dimensional representations of the data along multiple dimensions, e.g., time, parameter space, initial variable space, etc. An iterative procedure is then applied, wherein every step involves mapping the data into a lowdimensional manifold using the diffusion maps algorithm for each dimension, then recomputing the distances between the observations (for each dimension) through a reconciliation step that utilizes information from other dimensions, and then repeating the process. While, one could adapt the above method for the problem discussed here, it is not designed to provide insights about the sparsely populated regions in the manifold, which is crucial to guide the next round of simulations in our target application.
3. Challenges of Using SDR with Dynamic Process Data
As mentioned earlier, PCA and Isomap are two standard offtheshelf approaches to perform dimensionality reduction on highdimensional data. However, if the method is applied without taking into consideration the underlying assumption of data linearity and uncorrelated samples, it delivers highly misleading results. Here, we study the effectiveness of both PCA and Isomap when dealing with dynamic process data.
3.1. Comparing PCA and Isomap for Dynamic Process Data
A reliable way of determining the quality of the lowdimensional representation (mapping) produced by each method is to compare the original data $\mathbf{X}$ in ${\mathbb{R}}^{D}$ with the mapped data $\mathbf{Y}$ in ${\mathbb{R}}^{d}$, by computing the residual variance. The process of computing residual variance for PCA differs from Isomap, but the values are directly comparable.
In PCA, each principal component (PC) explains a fraction of the total variance in the dataset. If we consider ${\lambda}_{i}$ as the eigenvalue corresponding to the ith PC and $\Lambda $ as the total energy in the spectrum, i.e., $\Lambda ={\sum}_{i=1}^{D}{\lambda}_{i}$, then the variance explained by the ith PC can be computed as $\frac{{\lambda}_{i}}{\Lambda }$. The residual variance can be calculated as
$$R=1\sum _{i=1}^{d}\frac{{\lambda}_{i}}{\Lambda }.$$
In the Isomap setting, residual variance is computed by comparing the approximate pairwise geodesic distances, computed in G represented by matrix ${\mathbf{D}}_{G}$ (recall that G is a neighborhood graph), to the pairwise distances of the mapped data $\mathbf{Y}$, represented by matrix ${\mathbf{D}}_{Y}$:
$$R=1\rho {({\mathbf{D}}_{G},{\mathbf{D}}_{Y})}^{2}.$$
Here, $\rho $ is the standard linear correlation coefficient, taken over all entries of ${\mathbf{D}}_{G}$ and ${\mathbf{D}}_{Y}$.
To compare PCA and Isomap, we compare the residual variance obtained using PCA and Isomap on the earlier described material morphology evolution process data. The dataset consisted of six trajectories, corresponding to a unique combination of the parameters $\varphi $ and $\chi $. The comparison is shown in Figure 2. It is evident that PCA is not able to learn a reasonable lowdimensional mapping even when using 10 eigenvectors, while Isomap produces a highly accurate representation. For instance, while Isomap is able to explain about 70% of the variance using three dimensions, PCA requires more than nine dimensions. It should be noted that the ability to represent data in two or three dimensions is especially desired by domain experts, as it allows for data visualization and exploratory analysis.
If we visualize the data in 3D space using the PCA representation ($d=3$) (see Figure 3a), we observe that while PCA is able to describe the time aspect of the process evolution, it does not offer any additional insights into the process, making it ineffective for the task at hand. This can be primarily attributed to PCA’s inability to capture the nonlinear relationships between the high and lowdimensional data.
The 3D visualization of Isomap is shown in Figure 3b. We note that despite the fact that Isomap is better than PCA at minimizing the residual variance, the lowdimensional trajectories do not offer meaningful insights. The trajectories appear to diverge from one another, leaving no reasonable interpretation of the empty space, whereas one would expect some ordering with respect to the $\varphi $ parameter. This indicates that the standard application of Isomap is inadequate when working with parameterized highdimensional time series data. We obtain equally unsatisfactory results with other methods, including tSNE and LLE [14,25].
3.2. Standard Isomap and Dynamic Process Data
While the Isomap trajectories in Figure 3b diverge from the initial time point, close inspection reveals that the trajectories exhibit nondivergent behavior in the earlier steps, which is expected since the morphologies are expected to evolve in a similar fashion at the beginning of the simulation.
We only applied Isomap to the early stage data represented by the first 30 time steps of each trajectory (the threshold was selected by the domain expert). The results are shown in Figure 3c. The results here are more encouraging as one can clearly observe that data points for all trajectories cluster together before quickly diverging. This observation points us to the key deficiency of Isomap when dealing with dynamic process data. When data instances exhibit strong temporal correlations, the neighborhood computation for any data instance in Isomap is heavily dominated by other instances belonging to the same trajectory. Thus, Isomap cannot capture the relationships across different trajectories and the reduced representation is dominated by the time dimension, as can be seen in Figure 3b.
To further study this point, we consider the pairwise distance matrix $\mathbf{D}$ that contains the distance between every pair of data instances. The rows and columns of the matrix are ordered by the trajectory index and time, as shown in Figure 4a. Consider any pair of row (and column) indices, i and j, such that the trajectory index that contains the ith data instance is I, and the trajectory index that contains the jth data instance is J and ${\tau}_{i}$, and ${\tau}_{j}$ denotes the time index for each instance within the corresponding trajectory. Then, row i precedes row j in the matrix $\mathbf{D}$ if either $I<J$ or ${\tau}_{i}\le {\tau}_{j}$ if $I=J$.
We consider another visualization of this matrix (See Figure 4b), where the row ordering is retained, but each row contains the sorted distances (increasing order) of the corresponding data instance and all other instances in the dataset. We can observe that for the majority of instances, the first several nearest neighbors are always from the same trajectory. This is problematic because the ability of Isomap to learn an accurate description of the underlying manifold depends on how well the neighborhood matrix captures the relationship across the trajectories. We refer to this relationship as crosstalk, or mixing among the trajectories. For any given point, the desired effect would be that the nearest neighborhood set contains points from multiple trajectories. However, the sorted neighborhood matrix indicates a lack of mixing, which essentially means that the Isomap algorithm does not consider information from other trajectories when learning the shape of the manifold in the neighborhood of one trajectory.
3.3. Quantifying Trajectory Mixing
We propose a quantitative measure of the quality of neighborhoods in terms of the trajectory mixing by employing the informationtheoretic notion of entropy. Consider any data instance, x, which belongs to trajectory ${\Gamma}_{i}$. Let ${p}_{j}$ be the fraction of k nearest neighbors of x that lie on a trajectory ${\Gamma}_{j}$. The entropy of the kneighborhood of the data instance x is calculated as
$${H}_{x}^{k}=\sum _{\forall j,{p}_{j}\ne 0}{p}_{j}{log}_{2}{p}_{j}.$$
Similarly, we can define the kneighborhood entropy for a trajectory ${\Gamma}_{i}$ as the average of kneighborhood entropy for all data instances on the ith trajectory, i.e.,
$${H}_{{\Gamma}_{i}}^{k}=\frac{1}{{m}_{i}}\sum _{x\in {\Gamma}_{i}}{H}_{x}^{k}.$$
Neighborhood entropy is directly related to the mixing of the trajectories in the proximity of a given data instance. If the neighborhood entropy of an instance is high, it means that the nearest neighbors of that instance exist in a large number of trajectories, indicating a strong mixing of the trajectories. On the other hand, if the entropy of a data instance is low, it means that the nearest neighbors lies on a single or very few trajectories, indicating a low level of mixing.
3.4. Strategies for Inducing Trajectory Mixing
One simple way to induce more trajectory mixing is to increase k, since this would increase the neighborhood entropy of the points. In considering the average neighborhood entropy for six individual trajectories for various values of k, we find that the neighborhood entropy increases linearly with k. Thus, for a small value of k, Isomap is unable to obtain a meaningful lowdimensional representation, as evident when using standard Isomap, as discussed above, where k was set to 8. In contrast, using large k could result in the desired level of trajectory mixing. However, as discussed in the Isomap paper [2], the approximation error between the true geodesic distance on the manifold between a pair of points and the approximate distance calculated using Dijkstra’s algorithm is inversely related to k. For large k, Isomap is essentially reduced to PCA, and is unable to capture the nonlinearities in the underlying manifold.
Another strategy to induce trajectory mixing is subsampling, i.e., selecting a subset of points from a given trajectory. However, this would result in reduction of the data, which yields poor results. Alternatively, we could use skipping in the neighborhood selection, i.e., for a given point, skip the s nearest points before including points in the neighborhood. Unfortunately, in experimenting with skipping and subsampling approaches we experienced a loss in local manifold quality or data size, respectively. On the basis of this and the desire to achieve the most accurate local and global qualities, we propose an entropydriven approach in the next section.
4. EntropyIsomap
As we established earlier, standard Isomap does not work well for dynamic process data since neighboring data points are typically from the same trajectory. However, the global structure of the process manifold is determined by the relations between different trajectories. When kNN neighborhoods are computed, this can result in poor mixing of the trajectories. The mixing can also vary depending on the temporal location of the process. For example, if the trajectories are generated using similar initial conditions, the trajectories will exhibit strong mixing. However, the trajectories typically diverge subsequently. A neighborhood size k that produces good results in early stages might produce poor results later on in the process. A value of k that is large enough to work for all times might include so many data points that the geodesic and Euclidean distances become essentially the same, which results in a PCAlike behavior, defeating the purpose of using Isomap.
To address this situation, we propose to directly measure the amount of mixing and use it to change the neighborhood size for different data points adaptively. This mitigates the shortcomings of the two methods described in the previous section, which either discard data (subsampling) or lose local information (skipping).
Figure 5 shows that neighborhood entropy increases when the next nearest neighbors are added. We propose using a threshold on the neighborhood entropy, as defined above, to adaptively determine an appropriate neighborhood size, k. This modification allows the flexibility of larger neighborhoods in regions where it is necessary or desired to force mixing between trajectories.
An additional parameter, M, determines the largest possible neighborhood size. This userdefined parameter is used to ensure that the neighborhoods are not so large as to reduce Isomap to PCA. For datasets which contain trajectories in poorly sampled regions of the state space, M controls the size of the local neighborhood, without skewing the rest of the analysis, which would otherwise result in an unreasonably large value of k.
Algorithm 2 outlines the step of the proposed EntropyIsomap algorithm. While the proposed algorithm has some similarities to the standard Isomap algorithm (See Algorithm 1), there are some key differences. First, the algorithm takes two additional arguments, the target entropy level, $\widehat{H}$ to determine the optimal neighborhood size, and the maximum neighborhood size, M. The initial step, computing all pairwise distances for data instances in $\mathbf{X}$, remains the same as in the standard algorithm. Next, the algorithm performs the entropybased neighborhood selection (lines 3–9). For each point ${x}_{i}$, the algorithm starts with an initial neighborhood size, k, and identifies the knearest neighbors. These are used to compute the neighborhood entropy for the data instance. If the entropy threshold $\widehat{H}$ is not satisfied, then ${k}_{i}$ is incremented (line 6), and the process repeats. Once the entropy threshold is reached, or if ${k}_{i}$ reaches the maximum threshold of M, the process terminates. The entire process is repeated for each ${x}_{i}$, and after all neighborhoods have been identified, the algorithm continues the same way as standard Isomap (lines 10–12). While this iterative strategy for identifying optimal neighborhood size for each data instance appears to be inefficient, it is presented here for more clarity. In practice, the optimal neighborhood determination can be made via a more efficient binary searchbased strategy.
Algorithm 2 ENTROPYISOMAP 
Input:$\mathbf{X}$, k, $\widehat{H}$, M 
Output:$\mathbf{Y}$ 

We applied EntropyIsomap to the process data discussed earlier, with $k=8$ and the maximum number of steps $M=100$. The large M ensures that the algorithm is able to create very large neighborhoods to strictly enforce trajectory mixing. To understand the behavior of the method as a function of the entropy threshold, $\widehat{H}$, we varied $\widehat{H}$ from $0.1$ to $0.9$. The example lowdimensional representation obtained by EntropyIsomap is presented in Figure 6b.
Figure 7b shows how the final neighborhood entropy for the instances belonging to six trajectories vary according to time. We note that, for this data, the threshold for entropy ($\widehat{H}$, set to 0.3 for this experiment) is often not reached even after expanding ${k}_{i}$. Instead, the neighborhood size reaches the maximum limit, M. We believe that this is because the k nearest neighbors for the majority of points are in the same trajectory, (see Figure 4b), which leads to skewed neighborhood distributions. As a result, even when a satisfactory number of neighbors come from other trajectories, the entropy for the neighborhood might be low. Even when the trajectories mix, the neighborhoods are still dominated by other instances in the same trajectory, as shown in Figure 6c. Therefore, while high entropy implies good mixing, the converse is not necessarily true. Large neighborhoods could produce mixing, while still having low entropy. Thus, the entropy threshold is set to be around 0.30–0.40, with further confirmation through experimentation being presented in the next section.
If the entropy threshold is strictly enforced, one will have to increase the neighborhood size. For the process data, we observed that the neighborhood sizes will become very large. Figure 7a shows the neighborhood size distribution for $\widehat{H}=0.30$. For such values of k, the method simply reduces to PCA, a linear method, which suffers from limited expressiveness because of the linearity assumption.
Points that produce no mixing also end up with large neighborhoods, as EntropyIsomap tries to increase ${k}_{i}$ in order to meet the entropy threshold. These points occur when the dataset does not contain enough trajectories that pass near those particular states to produce good geodesic distance estimates. Interestingly, we found that plotting entropy versus time in Figure 7b reveals that trajectories can pass through poorly sampled parts of the state space and again “meet up” with other trajectories.
The proposed methods can be used to detect trajectories that do not interact and also which regions of the state space are poorly sampled. This can be used to either remove them from the dataset or as a guide to decide where to collect more process data.
5. Application
The current work is motivated by the need to analyze and understand big datasets arising in the manufacturing of organic electronics (OEs). OE is a new sustainable class of device, spanning organic transistors [26,27], organic solar cells [28,29], diode lighting [30,31], flexible displays [32], integrated smart systems such as RFIDs (Radiofrequency Identification) [33,34], smart textiles [35], artificial skin [36], and implantable medical devices and sensors [37,38]. The critical and highly desirable features of OEs are their cost, and their rapid and lowtemperature rolltoroll fabrication. However, many promising OE technologies are bottlenecked at the manufacturing stage, or more precisely, at the stage of efficiently choosing fabrication pathways that would lead to the desired material morphologies, and hence device properties.
The final properties of OEs (e.g., electrical conductivity) are a function of more than a dozen material and process variables that can be tuned (e.g., through evaporation rate, blend ratio of polymers, final film thickness, solubility, degree of polymerization, atmosphere, shearing stress, chemical strength, and frequency of patterning substrate), leading to the combinatorial explosion of manufacturing variants. Because the standard trialanderror approach, in which many prototypes are manufactured and tested, is too slow and cost inefficient, scientists are investigating in silico approaches [39,40]. The idea is to describe the key physical processes via a set of differential equations, and then perform highfidelity numerical simulations to capture the process dynamics in relation to input variables. Then, the problem becomes to identify and simulate some initial set of manufacturing variants, and use analytics of the resulting process data to first understand the process dynamics (e.g., rate of change in domain size or the transition between different morphological classes), and then identify new promising manufacturing variants.
The key scientific breakthroughs that improved organic solar cell (OSC) performance were closely related to the manufacturing pathway. Most advances have been achieved by nontrivial and nonintuitive (and sometimes very minor) changes in the fabrication protocol. Classic examples include changing the solvent [41], and thermal annealing [42], which together resulted in a two orders of magnitude increase in the efficiency of OSCs. These advances have reaffirmed the importance of exploring processing conditions to impact device properties, and have resulted in the proliferation of manufacturing variants. However, these variants are invariably chosen using trialanderror approaches, which has resulted in the exploration of very scattered and narrow zones of the space of potential processing pathways due to resource and time constraints.
5.1. Data Generation
The material morphology data analyzed in this paper was generated by a computational model based on the phasefield method of recording the morphology evolution during thermal annealing of the organic thin films [18,43]. We focused on the exploration of two manufacturing parameters: blend ratio $\varphi $ and strength of interaction $\chi $. We selected these two parameters, since they are known to strongly influence the properties of the resulting morphologies. For each fabrication variant ($\varphi ,\chi $), we generated a series of morphologies that together formed one trajectory $\Gamma (\varphi ,\chi )$.
We selected the range of our design parameters $\varphi =[0.5,0.6]$ and $\chi =[2.2,3.0]$ to explore several factors. First, we were interested in two stages of the process: early materials phase separation and coarsening. Moreover, we wanted to explore various topological classes of morphologies. In particular, we were interested in identifying the fabrication conditions leading to interpenetrated structures. Finally, we hoped to find the optimal annealing time that results in desired material domain sizes. In total, we generated 16 trajectories, with, on average, 180 morphologies per trajectory. Each morphology was represented as an image converted into a $40,000$ dimensional space defined by pixel composition values. The entire data used in our experiments as well as the source code of the method are open and available from https://gitlab.com/SCoReGroup/EntropyIsomap.
5.2. Results
For the target application of optimal manufacturing design, we leveraged dimensionality reduction to gain insights into the morphological data. First, we aimed to discover the latent variables governing the associated dynamic process. Second, we aimed to unravel the topology of the manifold in order to explore the input parameter space and ultimately identify the manufacturing variant that leads to the desired morphology.
Using EntropyIsomap, we performed the analysis for a set of morphological pathways. In Figure 8 and Figure 9, we depict the discovered threedimensional manifold for the set of 16 pathways. In the discovered manifold, the individual pathways are ordered according to process variables that were varied to generate the data. In each figure, the same manifold is depicted eight times. For easier inspection, each variant is individually color coded. The panels in the top row of Figure 8 depict the pathways for fixed $\varphi $ and varying $\chi $. We observe that the pathways for increasing $\varphi $ are ordered from front to back. The panels in the bottom row of Figure 8 highlight the pathways for increasing $\chi $. Similarly to the top row, we observe the pathways to be ordered from right to left.
Figure 8 depicts the manifold for the early stages of morphology evolution, while Figure 9 depicts the manifold for a longer evolution time. However, in both cases, the observed ordering is consistent. The pathways for increasing $\chi $ are ordered from the right (dark) to the left (light), while the pathways for increasing $\varphi $ are ordered from the front (green) to back (blue).
The observed ordering of pathways strongly indicates that the input variables are also the latent variables controlling the dynamics of the process. The discovered manifold also reveals that denser sampling is required along the blend ratio space variable ($\varphi $). Specifically, the pathways sharing the same $\chi $ but varying $\varphi $ are spread further apart on the manifold than those sharing the same $\chi $ value. This observation has important implications for the exploration of the design space. In particular, adding more sampling points in the $\varphi $ space offers higher exploration benefits, while adding more points in the $\chi $ space improves exploitation chance. This indicates that the $\varphi $ space should be explored first, followed by a potential exploitation phase.
Finally, using EntropyIsomap, we identified two regimes in the manifold. Morphologies in the early stages of the process are mapped to evolve in the radial direction, while morphologies in late stages are mapped to evolve parallel to each other. This is interesting as the underlying process indeed has two inherent time scales. In the early stage, the fast and dynamic phase separation between the two polymers occurs. During this stage, the composition of the individual phases changes significantly. These changes mostly increase composition amplitude. In the second stage, the equilibrium composition is already established. The coarsening between already formed domains dominates the dynamics of the process. Here, the amplitude of the composition (signal) does not change significantly. The changes mostly occur in the frequency space of the domain size, with the domain sizes increasing over time.
6. Conclusions and Future Directions
Existing spectral dimensionality reduction methods do not work well when the underlying data exhibits a temporal correlation, as is the case with dynamic process data. We propose using the notion of neighborhood entropy to quantitatively determine the amount of information exchanged among data instances, independent of the temporal component. On the basis of this measure, we propose the EntropyIsomap algorithm, which uses the entropy of the neighborhood to adaptively increase the neighborhood size, and thus facilitate crosstalk across different process trajectories.
We show how the proposed methodology can be used to better understand the morphology evolution of two immiscible materials. In future, the proposed approach can be leveraged for the smart and autonomous exploration of the manufacturing design spaces in organic electronics. The notion of mixing between individual trajectories can be used to detect the underexplored parts in the design, thus helping to build more reliable manifolds and increase confidence in the understanding of the dynamic process. The reduced order representation learnt by the proposed method shows a clear ordering of the trajectories according to the process variables. At the same time, the learnt representation reveals sparsely populated regions in the manifold, motivating the need for more samples in those regions. Thus, this insight can provide guidance in terms of designing the next round of simulations that can generate data corresponding to the undersampled process configurations. More importantly, the exposed undersampled regions of the design space can be used for active learning of the dynamic processes, potentially reducing the number of required numerical experiments to discover a stable manifold. Another promising direction is the use if a Bayesian surrogate, e.g., a Gaussian processbased model [44], as well as the use of predictive variance to execute the smart exploration of the manufacturing process.
Author Contributions
Conceptualization, V.C., N.N., O.W. and J.Z.; methodology, V.C., N.N., O.W. and J.Z.; software, F.S.; validation, F.S. and O.W.; data curation, F.S. and O.W.; writing—original draft preparation, F.S., V.C., N.N., O.W. and J.Z.; writing—review and editing, V.C., O.W. and J.Z.; funding acquisition, V.C., N.N., O.W. and J.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This work has been supported by the National Science Foundation under grant OAC1910539.
Acknowledgments
The authors would like to acknowledge support provided by the Center for Computational Research at the University at Buffalo.
Conflicts of Interest
The authors declare no conflict of interest.
References
 Schoeneman, F.; Mahapatra, S.; Chandola, V.; Napp, N.; Zola, J. Error metrics for learning reliable manifolds from streaming data. In Proceedings of the SIAM International Conference on Data Mining, Westin Galleria, Houston, TX, USA, 27–29 April 2017; pp. 750–758. [Google Scholar]
 Tenenbaum, J.; de Silva, V.; Langford, J. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000, 290, 2319. [Google Scholar] [CrossRef] [PubMed]
 Lim, I.; de Heras Ciechomski, P.; Sarni, S.; Thalmann, D. Planar arrangement of highdimensional biomedical data sets by isomap coordinates. In Proceedings of the IEEE Symposium on ComputerBased Medical Systems, New York, NY, USA, 26–27 June 2003; pp. 50–55. [Google Scholar]
 Dawson, K.; Rodriguez, R.; Malyj, W. Sample phenotype clusters in highdensity oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithm. BMC Bioinform. 2005, 6, 195. [Google Scholar] [CrossRef]
 Zhang, Q.; Souvenir, R.; Pless, R. On manifold structure of cardiac MRI data: Application to segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; pp. 1092–1098. [Google Scholar]
 Rohde, G.; Wang, W.; Peng, T.; Murphy, R. Deformationbased nonlinear dimension reduction: Applications to nuclear morphometry. In Proceedings of the IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Paris, France, 14–17 May 2008; pp. 500–503. [Google Scholar]
 Strange, H.; Zwiggelaar, R. Open Problems in Spectral Dimensionality Reduction; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
 Samudrala, S.; Zola, J.; Aluru, S.; Ganapathysubramanian, B. Parallel framework for dimensionality reduction of largescale datasets. Sci. Program. 2015, 2015, 180214. [Google Scholar] [CrossRef]
 Jenkins, O.; Matarić, M. A spatiotemporal extension to Isomap nonlinear dimension reduction. In Proceedings of the International Conference on Machine Learning, Louisville, KY, USA, 16–18 December 2004; p. 56. [Google Scholar]
 Schoeneman, F.; Chandola, V.; Napp, N.; Wodo, O.; Zola, J. EntropyIsomap: Manifold Learning for Highdimensional Dynamic Processes. In Proceedings of the IEEE International Conference on Big Data, Seattle, WA, USA, 10–13 December 2018. [Google Scholar]
 Cox, T.; Cox, M. Multidimensional Scaling, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2000. [Google Scholar]
 Schölkopf, B.; Smola, A.; Müller, K. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 1998, 10, 1299–1319. [Google Scholar] [CrossRef]
 Shaw, B.; Jebara, T. Minimum Volume Embedding. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, San Juan, Puerto Rico, 21–24 March 2007; Volume 2, pp. 460–467. [Google Scholar]
 Roweis, S.; Saul, L. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef] [PubMed]
 Hong, D.; Yokoya, N.; Zhu, X.X. Learning a robust local manifold representation for hyperspectral dimensionality reduction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 2960–2975. [Google Scholar] [CrossRef]
 Coifman, R.R.; Lafon, S.; Lee, A.B.; Maggioni, M.; Nadler, B.; Warner, F.; Zucker, S.W. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proc. Natl. Acad. Sci. USA 2005, 102, 7426–7431. [Google Scholar] [CrossRef]
 Belkin, M.; Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of the Advances in Neural Information Processing, Vancouver, BC, Canada, 9–14 December 2002; pp. 585–591. [Google Scholar]
 Wodo, O.; Ganapathysubramanian, B. Modeling morphology evolution during solventbased fabrication of organic solar cells. Comput. Mater. Sci. 2012, 55, 113–126. [Google Scholar] [CrossRef]
 Benner, P.; Gugercin, S.; Willcox, K. A Survey of ProjectionBased Model Reduction Methods for Parametric Dynamical Systems. SIAM Rev. 2015, 57, 483–531. [Google Scholar] [CrossRef]
 Talmon, R.; Coifman, R.R. Empirical intrinsic geometry for nonlinear modeling and time series filtering. Proc. Natl. Acad. Sci. USA 2013, 110, 12535–12540. [Google Scholar] [CrossRef]
 Talmon, R.; Coifman, R.R. Intrinsic modeling of stochastic dynamical systems using empirical geometry. Appl. Comput. Harmon. Anal. 2015, 39, 138–160. [Google Scholar] [CrossRef]
 Talmon, R.; Mallat, S.; Zaveri, H.; Coifman, R.R. Manifold Learning for Latent Variable Inference in Dynamical Systems. IEEE Trans. Signal Proc. 2015, 63, 3843–3856. [Google Scholar] [CrossRef]
 Duque, A.F.; Wolf, G.; Moon, K.R. Visualizing High Dimensional Dynamical Processes. arXiv 2019, arXiv:1906.10725. [Google Scholar]
 Yair, O.; Talmon, R.; Coifman, R.R.; Kevrekidis, I.G. Reconstruction of normal forms by learning informed observation geometries from data. Proc. Natl. Acad. Sci. USA 2017. [Google Scholar] [CrossRef]
 van der Maaten, L.; Hinton, G. Visualizing data using tSNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
 Crone, B.; Dodabalapur, A.; Lin, Y.Y.; Filas, R.; Bao, Z.; LaDuca, A.; Sarpeshkar, R.; Katz, H.; Li, W. Largescale complementary integrated circuits based on organic transistors. Nature 2000, 403, 521–523. [Google Scholar] [CrossRef] [PubMed]
 Dimitrakopoulos, C.D.; Mascaro, D.J. Organic thinfilm transistors: A review of recent advances. IBM J. Res. Dev. 2001, 45, 11–27. [Google Scholar] [CrossRef]
 Hoppe, H.; Sariciftci, N. Organic solar cells: An overview. J. Mater. Res. 2004, 19, 1924–1945. [Google Scholar] [CrossRef]
 Brabec, C.; Scherf, U.; Dyakonov, V. Organic Photovoltaics: Materials, Device Physics, and Manufacturing Technologies; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
 Tyan, Y.S. Organic lightemittingdiode lighting overview. J. Photonics Energy 2011, 1, 011009. [Google Scholar] [CrossRef]
 Thejo, K.N.; Dhoble, S. Organic light emitting diodes: Energy saving lighting technology—A review. Renew. Sustain. Energy Rev. 2012, 16, 2696–2723. [Google Scholar] [CrossRef]
 Crawford, G. Flexible Flat Panel Displays; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
 Myny, K.; Steudel, S.; Smout, S.; Vicca, P.; Furthner, F.; van der Putten, B.; Tripathi, A.K.; Gelinck, G.H.; Genoe, J.; Dehaene, W. Organic RFID transponder chip with data rate compatible with electronic product coding. Org. Electron. 2010, 11, 1176–1179. [Google Scholar] [CrossRef]
 Myny, K.; Steudel, S.; Vicca, P.; Smout, S.; Beenhakkers, M.J.; van Aerle, N.; Furthner, F.; van der Putten, B.; Tripathi, A.K.; Gelinck, G.H. Organic RFID tags. In Applications of Organic and Printed Electronics; Springer: Berlin/Heidelberg, Germany, 2013; pp. 133–155. [Google Scholar]
 Stoppa, M.; Chiolerio, A. Wearable electronics and smart textiles: A critical review. Sensors 2014, 14, 11957–11992. [Google Scholar] [CrossRef] [PubMed]
 Someya, T.; Sekitani, T.; Iba, S.; Kato, Y.; Kawaguchi, H.; Sakurai, T. A largearea, flexible pressure sensor matrix with organic fieldeffect transistors for artificial skin applications. Proc. Natl. Acad. Sci. USA 2004, 101, 9966–9970. [Google Scholar] [CrossRef] [PubMed]
 Lochner, C.; Khan, Y.; Pierre, A.; Arias, A. Allorganic optoelectronic sensor for pulse oximetry. Nat. Commun. 2014, 5, 5745. [Google Scholar] [CrossRef] [PubMed]
 Zhu, C.; Ninh, C.; Bettinger, C. Photoreconfigurable polymers for biomedical applications: Chemistry and macromolecular engineering. Biomacromolecules 2014, 15, 3474–3494. [Google Scholar] [CrossRef] [PubMed]
 Negi, V.; Wodo, O.; van Franeker, J.J.; Janssen, R.A.; Bobbert, P.A. Simulating phase separation during spin coating of a polymer–fullerene blend: A joint computational and experimental investigation. ACS Appl. Energy Mater. 2018, 1, 725–735. [Google Scholar] [CrossRef]
 Pfeifer, S.; Wodo, O.; Ganapathysubramanian, B. An optimization approach to identify processing pathways for achieving tailored thin film morphologies. Comput. Mater. Sci. 2018, 143, 486–496. [Google Scholar] [CrossRef]
 Shaheen, S.E.; Brabec, C.J.; Sariciftci, N.S. 2.5% efficient organic plastic solar cells. Appl. Phys. Lett. 2001, 78, 841–843. [Google Scholar] [CrossRef]
 Li, G.; Shrotriya, V.; Huang, J.; Yao, Y.; Moriarty, T.; Emery, K.; Yang, Y. Highefficiency solution processable polymer photovoltaic cells by selforganization of polymer blends. Nat. Mater. 2005, 4, 864–868. [Google Scholar] [CrossRef]
 Wodo, O.; Ganapathysubramanian, B. Computationally efficient solution to the CahnHilliard equation: Adaptive implicit time schemes, mesh sensitivity analysis and the 3D isoperimetric problem. J. Comput. Phys. 2011, 230, 6037–6060. [Google Scholar] [CrossRef]
 Mahapatra, S.; Chandola, V. Learning Manifolds from Nonstationary Streaming Data. arXiv 2018, arXiv:stat.ML/1804.08833. [Google Scholar]
Figure 1.
Highdimensional process data trajectories generated by a material morphology simulation. Each row corresponds to the instances sampled at different time points from a single trajectory. Each trajectory ${\Gamma}_{I}$ corresponds to different variants of the organic thin film fabrication process (described by parameters $\varphi $ and $\chi $): ${\Gamma}_{1}=\Gamma (\varphi =0.6,\chi =2.2),{\Gamma}_{2}=\Gamma (\varphi =0.6,\chi =3.0),{\Gamma}_{3}=\Gamma (\varphi =0.5,\chi =3.0)$. Each image is a highdimensional point capturing material morphology (different colors represent different types of polymer making the material). (Please view in color).
Figure 2.
Isomap and principal component analysis (PCA) run on simulation output produced by data with six trajectories for $\chi =3.0$ and $\varphi \in \{0.50,0.52,0.54,0.56,0.58,0.6\}$. The quality of the Isomap manifold and PCA subspace are assessed using residual variance.
Figure 3.
Lowdimensional ($d=3$) representation of six trajectories with fixed $\chi =3.0$ and variable $\varphi \in \{0.50,0.52,0.54,0.56,0.58,0.6\}$ obtained using (a) PCA, (b) Isomap with $k=8$, (c) Isomap with $k=8$ using only the first 30 time steps of each pathway. (Please view in color).
Figure 4.
Pairwise distances of all points with $\chi =3.0$ and from six trajectories for $\varphi \in \{0.50,0.52,0.54,0.56,0.58,0.6\}$ visualized in two ways. (Please view in color). (a) Distance matrix $\mathbf{D}$ for all data grouped by $\varphi $ and ordered by time step along both axes. Distances in subblocks along the main diagonal denote interpoint distances within a fixed $\varphi $value trajectory. Offdiagonal subblocks highlight distance between points lying on disjoint trajectories. (b) Distance matrix $\mathbf{D}$ with rows grouped by $\varphi $ and ordered by time step. Entries in row i are sorted by increasing distance from ${x}_{i}$ and colored according to their $\varphi $ value. Clusters of similar color nearest the left edge reflect kNN having common $\varphi $value for sufficiently large k.
Figure 5.
Neighborhood entropy of different trajectories as a function of k ($\chi =3.0$ and $\varphi \in \{0.50,0.52,0.54,0.56,0.58,0.6\}$).
Figure 6.
Six trajectories with fixed $\chi =3.0$ and the variable $\varphi \in \{0.50,0.52,0.54,0.56,0.58,0.6\}$ were selected to learn mapping and transform the data into three dimensions using (a) Isomap with $k=8$ and (b) EntropyIsomap with $k=8$, $\widehat{H}=0.3$. (c) Neighborhood crossmixing given by EntropyIsomap with $k=8$, $\widehat{H}=0.3$: for each trajectory $\Gamma $, neighbors of each point belonging to individual trajectories are aggregated and shown in stacked bar graph form. (Please view in color).
Figure 7.
EntropyIsomap with the default $k=8$ and entropy threshold $\widehat{H}=0.3$ run for data with $\chi =3.0$ and the variables $\varphi \in \{0.50,0.52,0.54,0.56,0.58,0.6\}$. (Please view in color). (a) Distribution of selected neighborhood sizes ${k}_{i}$ that did not reach the maximum $M=100$. (b) Entropy of the discovered neighborhood for each time step.
Figure 8.
The manifold of the early stage of the morphology evolution with the first 30 points per trajectory. To better illustrate the discovered ordering by two variables, we color coded the same manifold according to increasing $\varphi $ (top) and $\chi $ (bottom). (Please view in color).
Figure 9.
The manifold of the late stage with the first 80 points per trajectory. The same manifold is color coded according to increasing $\varphi $ (top) and $\chi $ (bottom). (Please view in color).
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).