Meta-Parameter Selection for Embedding Generation of Latency Spaces in Auto Encoder Analytics †

: Picking an appropriate parameter setting ( meta-parameters ) for visualization and embedding techniques is a tedious task. However, especially when studying the latent representation generated by an autoencoder for unsupervised data analysis, it is also an indispensable one. Here we present a procedure using a cross-correlative take on the meta-parameters. This ansatz allows us to deduce meaningful meta-parameter limits using OPTICS, DBSCAN, UMAP, t-SNE, and k-MEANS. We can perform ﬁrst steps of a meaningful visual analysis in the unsupervised case using a vanilla autoencoder on the MNIST and DeepVALVE data sets.


Introduction
High-dimensional data creates the need for simplification, of which low-dimensional embeddings as well as data visualization constitute two closely related methodologies. Their goal is to preserve the main patterns within the data and obtain a less complex data representation, which for two or three-dimensional embeddings grants also direct visual access on the data. It is well known that finding a low-dimensional data embedding is a meticulous, parameter-and data dependent task for which optimization may be difficult [1]. However, in our approach, we take into account that even the visualization space for an appropriate embedding is related to a set of visualization parameters, which we call meta-parameters.These are not directly optimized over, but introduce bias in the visualization itself when chosen poorly. One example the reader might know is the fact that DBSCAN suffers from the curse of dimensionality, when the minimal number of neighboring points n samples is chosen unfortunately [2,3]. For our investigation, we chose the challenging setting of data (namely MNIST [4] and DeepVALVE [5]) compressed within the latency space of an autoencoder.

Why Are Autoencoders Interesting?
The idea of autoencoders exists for more than 30 years [6] and the applications are presently widespread. They range from generalization to classification tasks, denoising, anomaly detection, recommender systems, clustering and dimensionality reduction with stunning results [7,[9][10][11][12][13]. Within this work, we focus on the latter two use cases, wherein autoencoders perform unsupervised feature extraction and dimensionality reduction [14,15]. Autoencoders consist of an encoder-decoder structure as explained in Figure 1. To achieve their above-mentioned goal, the data is embedded within a latency space via the encoder. Usually, the latent dimension is much smaller than the one of the original data set. This kind of setting is also known as bottleneck architecture. From this embedding, the original data representation is reconstructed by the decoder. The system is trained by minimizing the reconstruction error. Conceptually, autoencoders can be seen as a nonlinear generalization of PCA [16]. Under postulation of the manifold hypothesis [17], in some settings, they are supposed to learn the intrinsic low-dimensional data manifold embedded (nonlinearly) into the high-dimensional data observation space. Even more, in this vein they can be interpreted as a nonlinear embedding approach on their own. In the context of unlabeled high-dimensional data sets and especially time series, autoencoders have shown to be powerful tools for unsupervised analysis tasks [15,18]. Yet it has become clear in several applications that the "classical" loss term might not be enough to capture the desired behavior [19]. For this reason, some researchers try to ameliorate the reliability and efficiency of their autoencoder models by introducing additional, task dependent loss-terms (e.g., Ref. [20] introduced a topological loss term to preserve connected components within the data; Ref. [21] introduced a perceptual loss to improve image classification; Ref. [22] introduced a loss term to fix class centroids within a classification task).

Our Approach
In this work, we approach this problem upside down. We develop methods to investigate the autoencoder's capability to conform to the manifold hypothesis in a visual and qualitative way, which integrates into the general trend of visualization methods gaining more importance over the last while [23][24][25]. Our goal is to give data scientists a non-mathematical and interpretable tool at hand to monitor and supervise the nonlinear embedding process whose result constitutes the latency space. To do so, we proceed as follows: First, we must formulate our concepts. To make clear what is new to our approach, we must distinguish it from classical parameter and hyperparameter tuning models.

Definition 1 (Parameters).
Parameters are the quantities that determine the actual shape of the data manifold.
Intuitively, parameters determine the "physics" of our data under consideration. In the case of our autoencoder, they are given by the trainable weights.
Definition 2 (Hyperparameters). Hyperparameters are the quantities that determine the performance, the setup, and the training of our neural, data driven model in a metrisable way.
A summary of our autoencoder model and the corresponding hyperparameters can be found in Tables A1 and A2 in Appendix A. The decoder is just a mirror in our case. (Although sometimes a weight tie is implemented too, we adhere from this technique here).

Definition 3 (Meta-parameters).
Meta-parameters are the quantities that determine the performance of our neural, data driven model in a non-metrisable way.
So, it becomes clear why standard (hyper-)parameter optimization methods cannot be applied to the present purpose: Lacking a metric, there is now quantifiable (stochastic) optimization procedure to find an optimal embedding. For this reason, we took a step back on to a qualitative level and performed a cross-correlative study including t-SNE, UMAP, k-MEANS, DBSCAN and OPTICS.

Embedding and Visualization Methods
The use of visualization methods to analyze structures of interest for a higherdimensional space by a visual inspection of a lower-dimensional embedding has become a popular approach in recent years, compare [26][27][28][29][30][31][32][33][34][35]. Usually, embedding schemes are classified and distinguished based on their embedding properties, e.g., to discriminate linear and nonlinear embeddings. Thus, to cover an appropriate set of embedding techniques for reasons of comparison, our approach covers a comparative study of different embedding techniques. In the following, a short description of these methods is given. Table A3 in Appendix B states the meta-parameters and their default values.

t-SNE
The t-SNE algorithm assigns mutual "neighborhood"-probabilities based on a distance metric (most commonly the Euclidean one) between points, and successively tries to minimize the Kullback-Leibler divergence. The most important hyperparameter is the perplexity, which defines the minimum number of neighborhood points. However, the hyperparameters of the intrinsic optimization algorithm also have crucial impact on the final 2-or 3-dimensional embedding [36,37].

UMAP
This algorithm represents an advancement with respect to t-SNE by constructing a "fuzzy simplicial complex" on the data. However, choosing the appropriate radius for the related Cěch complex is a meticulous task. Additionally, the choice of the metric and the minimum number of neighboring points determine the resulting 2-or 3-dimensional embedding. Like t-SNE, UMAP's dependence on the metrified minimum point distance makes it prone to the curse of dimensionality [38].

k-MEANS
K-Means minimizes the metric distance of data points to predefined cluster centers. This also constitutes its major drawback, aside from not being able to identify noise and imposing complexity on all cluster shapes [39].

DBSCAN
Unlike k-MEANS, DBSCAN is a density-based method able to identify noise and clusters of all shapes. Its main hyperparameters are , the critical value for which points are seen to belong to the same cluster, and n samples , the minimum number of points that shall belong to one cluster. As is chosen globally, DBSCAN has its difficulties with clustering heterogeneous data [40].

OPTICS
OPTICS has many commonalities with DBSCAN. The most substantial difference to DBSCAN is that is chosen from a dendrogrammatic graph called the reachability plot. This is based on one of its two main parameters: the reachability distance. This expresses the smallest distance for an object p with respect to another object o, such that p is directly density-reachable from o if o is a core object. Intuitively, a core object is one that lies in the vicinity of n samples . The reachability plot depicts the reachability distances for each object in the cluster ordering. Clusters within the data set are regions where the reachability distance between points are small, so they correspond to "valleys" within the reachability plot. The reachability plot is rather insensitive to and n samples , but if is too small, then too many points will have an undefined reachability distance. In contrast to DBSCAN, OPTICS has difficulties when clustering homogeneous data [41].

Organization and Contribution of the Paper
The main part of our work is given by Section 2, where we elaborate on the nature of our cross-correlative approach before demonstrating how our iterative and interactive crossstudy systematically leads to more stable meta-parameter settings on MNIST in Section 2.1. Secondly, we apply our procedure to the DeepVALVE time series data in Section 2.2. In Section 3 we study the visualizations generated by the found meta-parameters. Finally, in Section 4, we conclude on the range of visualization meta-parameters and their connection to unsupervised learning. The contributions of this work are • autoencoder study on DeepVALVE data set • cross-correlative study of embedding technologies • procedure to gain manageable meta-parameter ranges • visual analysis of autoencoder latency spaces

Cross-Correlative Study on Meta-Parameters
For our comparative meta study of dimension reduction algorithms, we define the meta-parameters θ m to be θ m i∈I where I is the space of values the individual meta-parameters θ m i may take, see Table A3. A meta-parameter set of a concrete visualization might be a k-dimensional vector embedded into a k-dimensional meta-parameter space. To elucidate this, considering multi-parameter visualization such as the radial visualization method introduced by [42], one faces a (meta-) parameter space k with 2n parameters (k = 2n), n being the number of data dimensions. Finding a good meta-parameter combination introduces generally an NP-hard issue to optimize the meta-parameters in k-dimensions (within the single algorithm regime). Thus, our working hypothesis states insight can be gained about θ m by cross-studying θ m from a multi-algorithmic point of view: where A denotes the set of algorithms and θ m i ,A j denotes the m i -th meta-parameter of algorithm A j . Doing so saves the trouble of solving the (k-dimensional) meta-parameter problem for one specific algorithm. Instead, we iter-and interactively tune θ m i ,A j mutually to approach a valuable embedding and visual representation for the data in touch. Let R method be the range for the cardinality of cluster centers with respect to one of the methodologies as quoted above. Then our evaluation results in a cross-correlative range matrix Herein, δ i,j denotes the intersection of the range of cluster center cardinalities for two methods i, j: By definition, the matrix in Equation (3) is symmetric around the diagonal. The goal is now to find the minimum of the δ i,j to come as close to the true intrinsic dimension of the data manifold as possible.

MNIST
The MNIST data set is a well-known image data set containing the digitalization of around 60,000 handwritten digits from zero to nine. Many studies performed with this data set may be found in the literature [43,44]. Therefore, we omit any additional details of this data set except the fact that it is labeled, i.e., for each picture we know which digit is actually depicted. We start our analysis with the reachability plot for the OPTICS algorithm. For computational reasons we fix to 3.5, see Appendix C.1.
As shown on the right-hand side of Figure 2, no meaningful structures can be found for n samples < 15 as all points are qualified as noise, which refines the order of magnitude mentioned in [41] for meaningful n samples . The general features of the reachability plot itself are known to be stable under some (meaningful) variations of the meta-parameters and n samples [41]. Valleys in this plot, as shown on the left-hand side of Figure 2, may be connected to clustered structures in the studied latency space as explained in Section 1.3. Tuning = 1.85, i.e., the red dashed line in Figure 2, we can identify at least six independent structures at the same resolution scale. We also show other, rather poorly tuned values for , i.e., ∈ (1.50, 1.85, 2.50), indicated by the black dashed lines. To bolster this observation, we study the 2D embeddings as computed by t-SNE and UMAP in Figure 3. By eye we can see that both methods give a different perspective on the structure of the latent space. Using t-SNE alone we might identify between six and eleven structurally independent components. On the other hand, UMAP would provide us with six or maybe seven independent structures. Especially the derived upper bounds are very subjective. How should the gaps actually look to be counted as independent? At this point we see how the cross-correlative nature of our approach adds value. By now we have clearly established a lower limit of six cluster structures using Figure 2 (left) and Figure 3 (left and middle). In addition, we have limited n samples > 15. At the right-hand side of Figure 2, we show the number of identified clusters as well as the noise ratio for OPTICS as a function of n samples for different values of . We observe that it actually is the fine-tuned run which yields the best signal-to-noise ratio while simultaneously respecting the derived lower limits on n cluster . Indicating the so far deduced boundaries by green dashed lines we can set an upper limit on the number of identified clusters. Again, we have settled for rather conservative boundaries by working with n samples > 15. Using the best signal-to-noise ratio, both from Figures 2 and 3, yields n samples = 20 and thus an upper bound of 13 clusters instead of 18. Using this knowledge, let us study the next embedding tool on our list: DBSCAN. As OPTICS and DBSCAN are closely related we can use the already identified values of and n samples as starting points. This greatly reduces the meta-parameter space to be explored. Indeed, as we can see in Figure 3, DBSCAN favors slightly higher and lower values of n samples than OPTICS. However, as OPTICS requires values for and n samples high enough to not fall into the unstable regime, one should also choose n samples for DBSCAN not too low. This "unstable" behavior can be observed also in Figure 3 for values of n samples < 15. Hence we transfer the OPTICS limit to DBSCAN here and arrive at a fine-tuned limit of 11 clusters. So, in total we find 11 < n clusters < 18 Again, we emphasize that wherever necessary we use very conservative heuristics. Therefore, the suggested limits in Equation (5) capture the full structure of the latent representation as produced by our autoencoder.

DeepVALVE
The DeepVALVE data set consists of a series (in total around 25,000) of random opening and closing events of an industrial valve as described in [5]. A part of these events is shown in Figure 4. The allowed labels are: START, LOSE, LINEAR, STUCK, END. Thus, as in the case of MNIST, we have a completely labeled data set where we know the cluster cardinality beforehand, see Appendix D for more examples. As we deal with a time series data set, we must specify the way we feed our data to the neural network. Denoting our time series with X 0:T , we extract windows at time step t of window size w, i.e., X t−w:t . A batch is then created by randomly sampling t. As in this case our latent space is three-dimensional, we are actually able to plot it. The found structure for w = 10 is shown in Figure 5. We observe an ellipsoidal structure which is typical for quasi-periodic structures, as indicated in [45]. This is not surprising regarding the recurring opening and closing events of the valve. Now we want to apply the investigative pipeline we developed in Section 2.1. Hence again we start with the OPTICS reachability plot in Figure 6. We can identify several bigger and smaller structures. The reachability graph yields at least three or even four and more structures.  Adding the knowledge of Figure 7 we can estimate the lower limit of identified structures as four. Following Section 2.1 one can estimate n samples > 20 from the signal-tonoise ratio on the right-hand side of Figure 6. Again, we fine-tune using the reachability graph. We identify = 0.02 using this optical procedure. On the right-hand side of Figure 6 we show runs with different fine-tuned values. Indeed, the visual tuning turns out to be not sensitive enough and the actual range for epsilon is rather in the range of 0.01. We use this figure to estimate the upper limit of identified clusters to be 13. In Figure 7 one can observe the (within the context of temporal data emerging) fact that outliers can be detected with UMAP more easily than with t-SNE [46,47]. In addition to that, UMAP also preserves global structures better than t-SNE, although there are more advanced methods such as dynamic t-SNE including a notion of temporal coherence that allows for better cluster separation [48]. Summing up, from the t-SNE plot, in view of cluster sizes and distances with no specific meaning, one can identify (conservatively estimated) 7 clusters. However, the UMAP plot in the middle of Figure 7 indicates around 5 clusters. Using the limits deduced so far we study DBSCAN on the right-hand side of Figure 7. As with MNIST we observe that DBSCAN prefers slightly different values for . So, in total we find

Visualization of Clustered Data
In Section 2 we estimated the meta-parameters of our benchmark data set MNIST and our testing case DeepVALVE within Equations (5) and (6) respectively. However, how does this help us to gain a better visual understanding of the data set under investigation? Using our set of meta-parameters, we can now study the t-SNE and UMAP embeddings for our OPTICS, DBSCAN and k-MEANS clustering methods to obtain a first grasp on how well the data are classified and separated within the latent space. From Equation (5) we chose settings as disclosed in Table 1. In Figure 8 we show the clusters found by OPTICS, DBSCAN, and k-MEANS projected onto the t-SNE embedding. We observe that both OPTICS and DBSCAN exhibit oversimplification as has already been visible in Figure 3. Additional structures are only indicated, as few points have been assigned to them. K-MEANS, however, though able to resolve much more substructure, tends also to split certain structures which the other methods clearly identified as belonging together. The reason is that the predefiniton of cluster cardinalities introduces some bias. We observe a similar behavior when using the UMAP embedding in Figure 9 instead. This provides us with the possibility of a direct comparison between t-SNE and UMAP embeddings, which is not possible a priori.  Let us now apply the same procedure to our test data set DeepVALVE. Again, using the values from Table 1 we project the found clusters onto the t-SNE, respectively the UMAP embeddings. In Figures 10 and 11 we can see real structural differences of the DeepVALVE dataset to the MNIST dataset, Figures 8 and 9. Figure 10 (left and middle) clearly reveals that OPTICS is much more sensitive to heterogeneities within the data.  This can be an advantage but also a disadvantage: As DeepVALVE is a huge data set with densely distributed points, density-based clustering methods-and especially OPTICS-find more clusters for smaller training sets. For DeepVALVE, We observed a huge difference between 10,000 and 60,000 points (10,000 depicted in Figure 10). The reason is that larger distributions become "filled in" the more samples are drawn from the true distribution. k-MEANS, on the other hand, constitutes a biased version of clustering, which reveals itself for the MNIST as well as for the DeepVALVE data set within the t-SNE as well as within the UMAP embedding. A comparison of Figures 10 and 11 reveals the main advantage claimed for UMAP in the literature: That it can depict and preserve (global) similarities better [49]. This is even more critical for time series than for image data, as time series segmentation often exhibits not as many labels as classification tasks for image data. Hence the procedural error by choosing wrong cluster cardinalities rises significantly. Thus, our pipeline involving the cross-correlative usage of clusterings and embeddings raises awareness of this fact as well as giving a first hint onto the scale at which cluster center cardinalities can be expected.

Conclusions
Summing up what we have done and learned so far, we can identify four main benefits of our approach: We developed a pipeline to obtain a visual grasp on the generalization capacity of a vanilla autoencoder. (ii) We use clustering and embedding methods in a cross-correlative way to fine-tune their observational capabilities. (iii) This cross-correlative ansatz allows better capture of the interrelation between the (transformed) data and the visualizations and embeddings. (iv) Doing so, structural differences between data sets become apparent, which allows obtaining a first apprehension of an unknown data set without prior knowledge.

The Generalization Capacity vs. the Manifold Hypothesis
One should keep in mind the reason for investigating the latency space in this detailed fashion: We want to have a grasp on the generalization assumption. This is connected, but not identical to the manifold hypothesis as presented in the introduction. For both of our data sets we know the cluster center cardinalities beforehand and hence we can evaluate the individual performance of our clustering algorithms on the latent space. However, if this is not the case-which it should be for unsupervised learning tasks-our cross-correlative ansatz can give a first hint.

Meta-Parameter Fine-Tuning
In Equations (5) and (6) we present the results of our (visual) meta-parameter finetuning. Especially Figures 2 and 6 reveal how visual investigation ameliorates our results. Although these clustering and embedding methods work well within certain ranges of parameters, as e.g., Ref. [41] points out and investigates in detail for OPTICS, visual methods and their consecutive analysis can really suffer from poorly chosen meta-parameters. So, by working in a cross-correlative way one introduces a level of quantitivity that one would completely loose when restricting to one method.

Interrelation between Data and Methodology
In Figure 12 the latent space of the DeepVALVE dataset is investigated using our three different clustering methods, and one can clearly see that something goes wrong for OPTICS. So why is this the case? DeepVALVE is a dense temporal data set, and one would expect the clusters corresponding to the temporal labels to lie at the "edges" of the quasi-periodic structure depicted in Figure 12. However, unlike DBSCAN, OPTICS uses not a point value, but a hierarchical scale range for the reachability distance. Thus, if we have a really dense data set and comparatively few samples to estimate its distribution, it might identify large parts of the data set as noise. This can happen neither with DBSCAN nor k-MEANS. Henceforth, we have another demonstration that also visual methods should be taken with a grain of salt at least in the unsupervised case.

Structural Differences between Data Sets
In Sections 2.1 and 2.2 we studied two structurally different data sets with the same analysis pipeline as developed in Section 2. Although MNIST constitutes a 2D image dataset, DeepVALVE consists of temporal measurements of a physically non-trivial process and hence exhibits more structure, as depicted in Figure 4. This is clearly visible from the clustering parameters and n samples , indicating DeepVALVE is a much denser data set than MNIST, as well as from the respective visualizations. Especially in Figure 8 to Figure 11 this shows itself, as discussed in Section 3.

Future Outlook and Comparison to Other Work
In [46] a deep convolutional autoencoder was used as a dimensionality reduction method for the subsequent 2D visualization using PCA, UMAP and t-SNE. They too developed a pipeline for a quantitative investigation; however, in contrast to our work, they did not use the visualization and embedding methods in a cross-correlative way. As our results indicate, e.g., in Figures 2 and 6, this adds value to the inter-correlated usage of density-based clustering methods. For future investigation, we plan to migrate our visual meta-parameter selection pipeline (partly) to the hyperparameter learning level. Especially the qualitative analyses in Figures 2 and 6 would profit from a deeper, quantitative treatment. Furthermore, we would like to investigate the conjunction between the cardinality of training samples necessary to obtain a "good" estimate on the data distribution and data density in a more sophisticated manner. Especially temporal data sets are prone to heterogeneities that even have physical meaning rather than just being clustering or embedding artefacts. Having performed this comprehensive study, we are keen to walk one step further on this road. Data Availability Statement: MNIST is available from http://yann.lecun.com/exdb/mnist/. DeepValve is a company-internal IAV dataset. It will be published in an anonymised fashion following this publication.

Appendix A. Autoencoder Hyperparameters and Architecture for Reproducibility
In Table A1 our choices for the autoencoder hyperparameters are listed. Please note that if not mentioned otherwise, the default values of PyTorch (Version 1.8.1) are used.  Table A2 summarizes the encoder-decoder structure of the autoencoder as well as the final validation loss. Please note that the decoder is a mirror of the encoder. Therefore, we omitted the numbers in Table A2. Table A3. List of meta-parameters used in this study. In Figure A1 we show additional plots using different values for n samples and . As stated in [41] the key features of this plot are rather stable against different choices of the meta-parameters.

Appendix C.2. Reconstructed Digits
For MNIST we can qualitatively check the identified structures. For all three clustering approaches we construct a cluster center. For k-MEANS this is done automatically by the algorithm. On the other hand, for OPTICS and DBSCAN we just use the center of mass of all points belonging to a given cluster. We then reconstruct the images by sending these points through the decoder.
In Figure A2 we present the reconstructions corresponding to the right-hand side of Figure 8, respectively Figure 9 in the main text. We observe that indeed most of the digits could be identified. However, digit 4 is missing, while digit 1 and 9 are doubled. A behavior we already observed in Section 2.1. Figure A2. Reconstructed images of the centroids of the cluster using K-Means clustering with n clusters = 11.
Once we increase the allowed number of clusters to n clusters = 18, as shown in Figure A3, we observe that now all digits are present. However, we also have quite some doubling in digits 0 to 4. Figure A3. Reconstructed images of the centroids of the cluster using K-Means clustering with n clusters = 18.
As displayed in Figure A4 a similar behavior emerges when we use DBSCAN instead. Using the values from Table 1 we recover most digits except 8 and 9. Again for the other digits we have several clusters they belong to. Finally in Figure A5 we show the reconstructed digits for OPTICS. Again, we observe missing digits, 3 and 5 this time, as well as two versions of 4. Interestingly, k-MEANS has trouble locating different digits when compared to OPTICS and DBSCAN. The latter two behave rather similar again.