Fuzzy Information Discrimination Measures and Their Application to Low Dimensional Embedding Construction in the UMAP Algorithm

Dimensionality reduction techniques are often used by researchers in order to make high dimensional data easier to interpret visually, as data visualization is only possible in low dimensional spaces. Recent research in nonlinear dimensionality reduction introduced many effective algorithms, including t-distributed stochastic neighbor embedding (t-SNE), uniform manifold approximation and projection (UMAP), dimensionality reduction technique based on triplet constraints (TriMAP), and pairwise controlled manifold approximation (PaCMAP), aimed to preserve both the local and global structure of high dimensional data while reducing the dimensionality. The UMAP algorithm has found its application in bioinformatics, genetics, genomics, and has been widely used to improve the accuracy of other machine learning algorithms. In this research, we compare the performance of different fuzzy information discrimination measures used as loss functions in the UMAP algorithm while constructing low dimensional embeddings. In order to achieve this, we derive the gradients of the considered losses analytically and employ the Adam algorithm during the loss function optimization process. From the conducted experimental studies we conclude that the use of either the logarithmic fuzzy cross entropy loss without reduced repulsion or the symmetric logarithmic fuzzy cross entropy loss with sufficiently large neighbor count leads to better global structure preservation of the original multidimensional data when compared to the loss function used in the original UMAP algorithm implementation.


Introduction
Research in artificial intelligence and machine learning introduced plenty of algorithms that are now widely used in the automation of processes that earlier required human intervention. Such algorithms include neural networks [1], extreme learning machines [2], support vector machines [3,4], and other algorithms that are often used by researchers and practitioners in order to solve classification, regression and clustering problems. These algorithms often work with objects represented by high dimensional vectors, and high dimensional data, as well as the decisions made by a trained machine learning algorithm, which might be hard or barely possible to interpret.
Dimension reduction algorithms address the described problem by making high dimensional data visually interpretable. A typical dimensionality reduction algorithm accepts a dataset with objects represented as high dimensional vectors, and outputs a new dataset, containing low dimensional vectors representing the same objects from the original dataset. Data visualization is only possible in two-or three-dimensional spaces. Hence, if a dimensionality reduction algorithm reduces the number of components in vectors representing objects from the original dataset to either two or three, then one will be able to easily visualize the dataset as a scatter plot.
In this research, we reimplement the UMAP algorithm from scratch without using the sampling-based approach during the loss function optimization process. This allows us to incorporate custom loss functions into the UMAP algorithm, and to investigate the performance of different fuzzy information discrimination measures optimized during low dimensional embedding construction that is performed by the UMAP algorithm. We employ the state-of-the-art Adam algorithm [25] during the optimization process. The Adam algorithm is a first-order optimization method. First-order optimization methods exploit information on values and gradients of an optimized function. Hence, we have to derive the gradients of the considered loss functions analytically. After deriving the gradients of the losses, we compare the visualizations obtained while using different losses with different UMAP hyperparameters.
Based on the findings described in [19], the use of loss functions other than the default sampling-based one [10] could possibly lead to different low dimensional embeddings, that potentially better preserve the original structure of a multidimensional dataset. This might simplify the visual interpretability of the data in different domains [12,14], as well as positively affect the accuracy of clustering algorithms based on the preliminary evaluation of the UMAP algorithm.
The results of the study show that the use of either the original logarithmic fuzzy cross entropy or symmetric fuzzy cross entropy leads to better global structure preservation of the original dataset, in case the nearest neighbor count is sufficiently large.

Fuzzy Weighted Undirected Graph Construction in the UMAP Algorithm
The UMAP algorithm has the potential to better preserve both the local and global structure of high dimensional data while performing nonlinear dimensionality reduction, when compared to algorithms such as PCA, multidimensional scaling (MDS), t-SNE, and LargeVis [10].
Recent findings show that the original UMAP implementation optimizes fuzzy cross entropy with drastically reduced repulsion [22], but not the original fuzzy cross entropy as defined in [20,21]. According to [19], the choice of loss function drastically affects the performance of a nonlinear manifold learning algorithm. The reference implementation of the UMAP algorithm uses a sampling-based approach for the sake of performance [10], and this complicates the extensibility of the UMAP with custom losses. Therefore, we reimplement the UMAP algorithm from scratch with an intention to investigate the performance of the considered nonlinear dimensionality reduction technique with different fuzzy information discrimination measures [21,23,24] used as loss functions while constructing low dimensional embeddings.
In this section, we briefly describe the considered manifold learning algorithm. The UMAP algorithm consists of two phases, a fuzzy weighted undirected graph is constructed during the first phase of the nonlinear dimensionality reduction process, and the loss function is optimized during the second phase.
The UMAP algorithm accepts a dataset X = → x 1 , → x 2 , . . . , → x n , which contains n objects. Every object → x i ∈ X is represented by an h-dimensional vector containing real numbers. In order words, ∀ → x i ∈ X : → x i ∈ R h . First, the algorithm searches for k nearest neighbors T i = → t i1 , . . . , → t il , . . . , → t ik for every object → x i ∈ X, assuming ∀ → t il ∈ T i : → t il ∈ X. The k nearest neighbor search is performed using the approach proposed in [26]. For every found neighbor from the T i set, the scalar distance value d il between → x i and → t il ∈ T i is computed using a distance metric. The distance metric used for this step is the hyperparameter of the UMAP algorithm. In the case that one uses the Euclidean distance metric, the scalar d il value is computed as follows: where i is the number of an object from the X set; l is the number of one of the k nearest neighbors of the i-th object; h denotes the dimensionality of the → x i ∈ X vector representing the i-th object, the dimensionality of → x i is equal to the dimensionality of its l-th nearest neighbor → t il ∈ T i ; T i is a subset of the original dataset X containing nearest neighbors of the i-th object; and d il ∈ R is the scalar distance value between the i-th object and its l-th nearest neighbor from the T i set.
As a result, for every object → x i ∈ X the dimensionality reduction algorithm determines a set D i = {d i1 , . . . , d il , . . . , d ik } containing the distances between → x i and each of its k nearest neighbors.
After computing the distances to each of the k nearest neighbors of → x i , a fuzzy simplicial set is constructed, represented as a vector → µ i ∈ R n , where n denotes the object count in the original high dimensional dataset. In order to construct the → µ i vector for every i-th object, the algorithm searches for ρ i ∈ D i , such that ∀d il ∈ D i : ρ i ≤ d il . After that, a binary search is performed in order to find σ i , which satisfies the following condition: where i is the number of an object from the X set; l is the number of one of the nearest neighbors of the i-th object; k denotes nearest neighbor count; σ i ∈ R is the target variable; ρ i ∈ D i is the distance between the object → x i and its nearest neighbor from the T i set containing k neighbors; and d il ∈ D i denotes the distance between the object → x i and its l-th neighbor from the T i set.
After determining ρ i and finding σ i satisfying (2) for every i-th object → x i from the original multidimensional dataset X, a sparse vector → µ i ∈ R n is constructed. Every j-th scalar component of the where i is the object number for which the → µ i vector is being constructed; j is the number of a possible neighbor of the i-th object from the X set, and also the number of a component of the → µ i vector, j = {1, 2, . . . , n}; ρ i is the minimum distance from the D i set; d ij is the distance between → x i and → x j ; and the dimensionality of the → µ i vector is n, where n denotes object count in the multidimensional dataset X; µ ij ∈ [0, 1].
As a result, for every object → x i ∈ X a sparse vector → µ i ∈ R n is obtained, which encodes fuzzy similarities between the i-th object and every j-th object belonging to the original high dimensional dataset X. Given that i = {1, 2, . . . n}, the algorithm constructs a sparse weighted adjacency matrix M ∈ R n×n , where n rows are represented by n sparse fuzzy vectors → µ i . The weighted adjacency matrix M represents a fuzzy weighted oriented graph encoding pairwise similarities of objects from X, M is not symmetric.
On the next step, the asymmetric matrix M is symmetrized using probabilistic tconorm according to the following formula: where i and j are numbers of rows and columns in the M matrix, respectively, noting that µ ii and µ jj are equal to 0. As a result, the adjacency matrix M becomes symmetric.

Loss Function Optimization in the UMAP Algorithm
The initial low dimensional representations of high dimensional objects given by h-dimensional vectors from the X set in the R m space are computed using spectral embedding [8], assuming m ≤ h. After applying spectral embedding to the X set, the matrix Y ∈ R n×m is obtained, where n denotes object count in the original dataset X, and m denotes the dimensionality of the target low dimensional space. After computing the initial locations of objects from X in the R m space, the algorithm starts the loss function optimization process. According to [22], the original UMAP algorithm implementation uses weighted fuzzy cross entropy with reduced repulsion as the loss function: where M ∈ R n×n denotes the symmetric adjacency matrix, containing fuzzy values, encoding pairwise similarities of high dimensional objects from the X set (see Section 2.1); Y ∈ R n×m denotes representations of n objects in the low dimensional space R m ; µ ij ∈ [0, 1] denotes a scalar value representing fuzzy similarity of i-th and j-th high dimensional objects from the original X set; and ν ij ∈ [0, 1] denotes a scalar value representing fuzzy similarity of i-th and j-th objects in low dimensional space R m . In order to determine the pairwise similarity ν ij of i-th and j-th objects represented by i-th and j-th rows of the Y ∈ R n×m matrix in the low dimensional space R m the following formula is used: where d ij denotes the scalar distance value between the i-th and j-th objects, → y i and → y j , represented by rows in the Y matrix, the d ij value can be computed using the Euclidean distance Formula (1), assuming → x i and → t il vectors in (1) are replaced with → y i and → y j respectively, and h is replaced with m in (1); a and b are the coefficients that are chosen by non-linear least squares fitting of (6) against the following curve: where d ij denotes the scalar distance value between the i-th and j-th objects, → y i and → y j , represented by rows in the Y matrix, d min is the hyperparameter of the UMAP algorithm, the recommended values of d min belong to (0, 1] and affect the density of the clusters formed during the loss function (5) optimization process in the low dimensional space R m by the objects contained in the Y matrix.
In the UMAP algorithm, the optimization of the loss (5) is performed using stochastic gradient descent [10]. The locations of objects that are represented by rows in the matrix Y ∈ R n×m are modified on every iteration of the stochastic gradient descent algorithm in order to minimize the loss function.
Stochastic gradient descent is a first-order optimization method that exploits the information on values and gradients of a function being optimized. In order to apply a gradient-based algorithm, the gradients of a loss function have to be determined either analytically or numerically. In this paper, we analytically derive the gradients of all of the considered loss functions, this allows us to save the computational time required to determine the gradients numerically.
In order to derive the gradients, the loss (5) can be transformed into: The terms that do not depend on the Y matrix in Equation (8) are constant on every iteration of the optimization algorithm. After removing the constant terms and replacing ν ij according to (6), Equation (8) is transformed into the following shape: After splitting the function (9) into attractive component L ∼ a and repulsive component L ∼ b that can be independently differentiated, we get the following equation: The first order partial derivative of L ∼ a (10) with respect to d ij is given by: The first order partial derivative of L ∼ b (10) with respect to d ij is given by: Hence, the first order partial derivative of L ∼ 1 with respect to d ij is given by: During the optimization process of the loss function (5) using the gradient (13) the original UMAP implementation also respects the derivative of the d ij Euclidean distance metric. UMAP uses a sampling-based approach, meaning that on every iteration of the original UMAP algorithm, the attractive force L UMAP attr is applied to every pair of objects from the Y set in case the objects are neighbors, with probability determined by the fuzzy value µ ij ∈ [0, 1] indicating the similarity of the two objects. If the two objects are not nearest neighbors, then they are spread away from each other by applying repulsive force L UMAP rep to the objects. The forces are given by [10]: The signs of the forces in (14) differ from the signs of the terms in (13) due to the fact that during loss function minimization using gradient descent the algorithm is moving towards the negative gradient of the loss function.

Fuzzy Cross Entropy Loss
Other fuzzy information discrimination measures exist [20,21], except the weighted fuzzy cross entropy loss with reduced repulsion (5), that is optimized in the original UMAP implementation, using gradient descent with a sampling-based approach. In this study, we investigate the applicability of other information discrimination measures in the UMAP algorithm. One such measure is fuzzy cross entropy [20,21], the simplest measure of information discrimination between two fuzzy sets, this measure was derived from Shannon entropy [23].
Fuzzy cross entropy can be used in UMAP while estimating how similar high dimensional objects from X and their low dimensional representations given by rows in Y are. In UMAP, high dimensional objects are first transformed into a weighted adjacency matrix M ∈ R n×n , the transformation process is described in Section 2.1. The initial low dimensional representations Y ∈ R n×m of objects from the X set are computed by applying spectral embedding [8] to X, assuming m is the dimensionality of the target low dimensional space. Similar to (5), fuzzy cross entropy used to measure information discrimination between the weighted adjacency matrix M and low dimensional representations Y is given by the following equation: where M ∈ R n×n denotes the symmetric weighted adjacency matrix, where every i-th row represents the i-th object from the X set and contains fuzzy values describing how similar the i-th object is to every other object from the X set; n denotes object count in the original dataset X; Y ∈ R n×m denotes low dimensional representation of n objects from X; m denotes the dimensionality of the target low dimensional space; µ ij ∈ M denotes the fuzzy value describing the similarity of the i-th and j-th objects in high dimensional space X; and ν ij denotes the fuzzy value describing the similarity of the i-th and j-th objects in low dimensional space R m , ν ij value is computed according to (6). Similar to (5) and (8), Equation (5) can be transformed using the properties of the logarithmic functions, and the constants that do not depend on Y can be ignored during the optimization process. Similar to (9), replacing ν ij according to (6) transforms (15) into the following equation: (16) where a and b denote the coefficients selected before the optimization process starts by non-linear least squares fitting of (6) against the curve (7), and d ij denotes the distance between i-th and j-th objects in the low dimensional space R m . While the only difference between (9) and (16) is in the repulsive component weight, the first-order partial derivative of (16) with respect to d ij , similar to (13), is given by:

Symmetric Fuzzy Cross Entropy Loss
Symmetric fuzzy cross entropy [20,21] is a symmetric modification of (15) and can also be used to quantify the similarity of the graph M ∈ R n×n and the matrix Y containing n objects belonging to the low dimensional space R m . Similar to (15), in the considered problem, symmetric fuzzy cross entropy is given by: where M ∈ R n×n denotes the symmetric weighted adjacency matrix, where every i-th row represents the i-th object from the X set and contains fuzzy values µ ij describing how similar the i-th object is to every other j-th object from the X set; n denotes object count in X; Y ∈ R n×m denotes the low dimensional representation of n objects from X; and ν ij denotes a fuzzy value representing i-th and j-th object similarities in R m . After the replacement of ν ij in (18) according to (6), the transformation of (18) using the properties of the logarithmic functions gives the loss the following shape: After excluding terms that do not depend on d ij , Equation (19) transforms into: The obtained function (20) can be then split into three terms L ∼ a , L ∼ b , and L ∼ c : The first order partial derivative of L ∼ a with respect to d ij is given by: The first order partial derivative of L ∼ b with respect to d ij is given by: The L ∼ c term of (21) can be differentiated trivially: The summation of the obtained derivatives (22), (23), and (24), leads to the following form of the derivative of (21) after several polynomial transformations:

Modified Fuzzy Cross Entropy Loss
The modified fuzzy cross entropy measure of information discrimination between two sets was proposed in [23]. Modified fuzzy cross entropy is an asymmetric measure. Similar to the considered losses (5), (15), and (18), the modified fuzzy cross entropy loss applied to low dimensional embedding construction in UMAP is given by: where M ∈ R n×n denotes the symmetric weighted adjacency matrix, every i-th row of M represents the i-th object from the X set and contains fuzzy values µ ij describing how similar the i-th object is to every other j-th object from the X set; n denotes object count in X; Y ∈ R n×m denotes the low dimensional representation of n objects from X; and ν ij denotes a fuzzy value representing i-th and j-th object similarities in R m . The transformation of (26) in a fashion similar to (5), (15), and (18), by using the properties of logarithmic functions and removing the constant terms, leads to the following: After replacing ν ij with (6) and splitting (27) into two terms, (27) transforms into: First-order partial derivative of the first term L ∼ a in (28) with respect to d ij is given by: First-order partial derivative of the second term L ∼ b in (28) with respect to d ij is: Hence, the first order partial derivative of (28) with respect to d ij is given by:

Adam Optimization Algorithm
First-order partial derivatives (13), (17), (25), (31) of the considered loss functions (5), (15), (18), (26) were obtained analytically. Hence, the locations of high dimensional objects from X in the low dimensional target space R m can be optimized by applying first-order optimization methods to the discussed fuzzy losses. The Algorithm 1 [25] optimization algorithm is often used while training neural networks [27,28]. The pseudocode of the gradient-based Adam optimization algorithm is given by: initialize the c 0 and v 0 tensors filled with zeros 3. set = 10 −8 4.
while the stop condition is not met do: The parameters of the Adam optimization algorithms β 1 and β 2 are often set to 0.9 and 0.999 respectively, the parameter is used to avoid division by zero, and the step size η is set depending on the considered domain. The dimensionality of the c t and v t vectors is equal to the dimensionality of the candidate solution s 0 .
In the low dimensional embedding construction problem in UMAP, the Adam algorithm is applied to one of the considered loss functions. During the optimization process, the algorithm uses the weighted adjacency matrix M ∈ R n×n as the first argument in functions (5), (15), (18), (26) and n defines object count in the original high dimensional dataset X. The process of weighted adjacency matrix construction was described in Section 2.1. As the second argument in (5), (15), (18), (26), the algorithm uses the Y ∈ R n×m matrix, where m denotes the dimensionality of the target space.
Given that, the matrix Y is used as a candidate solution s t in Adam on every iteration t, the initial solution s 0 is constructed from the original high dimensional X set using spectral embedding [8]. The optimization process is stopped when the specified iteration limit is reached.

Fuzzy Weighted Adjacency Matrix Construction
In order to compare the performance of the considered loss functions in the UMAP algorithm, we used datasets generated by the sklearn library [29]. The generated datasets contained 1500 points belonging to R 2 , separated into several noisy clusters of different shapes and sizes. Applying UMAP to datasets containing objects belonging to R 2 allows one to get more context regarding the mutual displacement of objects in the original dataset, as the objects from R 2 can be visualized as is. This allows one to compare the positions of objects from the original dataset with the positions of objects obtained after applying UMAP transformations using different loss functions. Visualizations of the original locations of the generated points in R 2 are shown in Figure 1.

8.
= × × + 9. end loop 10. return The parameters of the Adam optimization algorithms and are often set to 0.9 and 0.999 respectively, the parameter is used to avoid division by zero, and the step size is set depending on the considered domain. The dimensionality of the and vectors is equal to the dimensionality of the candidate solution .
In the low dimensional embedding construction problem in UMAP, the Adam algorithm is applied to one of the considered loss functions. During the optimization process, the algorithm uses the weighted adjacency matrix ∈ ℝ × as the first argument in functions (5), (15), (18), (26) and defines object count in the original high dimensional dataset . The process of weighted adjacency matrix construction was described in Section 2.1. As the second argument in (5), (15), (18), (26), the algorithm uses the ∈ ℝ × matrix, where denotes the dimensionality of the target space. Given that, the matrix is used as a candidate solution in Adam on every iteration ,the initial solution is constructed from the original high dimensional set using spectral embedding [8]. The optimization process is stopped when the specified iteration limit is reached.

Fuzzy Weighted Adjacency Matrix Construction
In order to compare the performance of the considered loss functions in the UMAP algorithm, we used datasets generated by the sklearn library [29]. The generated datasets contained 1500 points belonging to ℝ , separated into several noisy clusters of different shapes and sizes. Applying UMAP to datasets containing objects belonging to ℝ allows one to get more context regarding the mutual displacement of objects in the original dataset, as the objects from ℝ can be visualized as is. This allows one to compare the positions of objects from the original dataset with the positions of objects obtained after applying UMAP transformations using different loss functions. Visualizations of the original locations of the generated points in ℝ are shown in Figure 1. Figure 1. Locations of 1500 points belonging to the datasets generated by sklearn [29] in ℝ : (a) blobs; (b) moons, noise level is set to 0.05; (c) circles, noise level is set to 0.05, inner circle radius is equal to one half of the outer circle radius.
In addition, we considered the dataset [30] containing 1797 images of handwritten digits from zero to nine, the images were represented as matrices of shape ℝ × . Every In addition, we considered the dataset [30] containing 1797 images of handwritten digits from zero to nine, the images were represented as matrices of shape R 8×8 . Every cell in such a matrix is characterized by color, encoded as an integer belonging to the [0, 16] interval. Every image from this dataset can be represented by a vector of shape R 64 , components of which are integers belonging to [0, 16]. The visualization of handwritten digits from the [30] dataset created with sklearn [29] is shown in Figure 2. cell in such a matrix is characterized by color, encoded as an integer belonging [0, 16] interval. Every image from this dataset can be represented by a vector of ℝ , components of which are integers belonging to [0,16]. The visualization of written digits from the [30] dataset created with sklearn [29] is shown in Figure 2. The UMAP algorithm was implemented in the Python programming language such libraries as numpy [31] and numba [32], as described in Section 2.1. First, the U algorithm searches for nearest neighbors for every object in the original high d sional dataset, and then computes distances to the nearest neighbors. The v the hyperparameter of the UMAP algorithm. As we see later, choosing bigger might improve dataset global structure preservation while reducing the dimensio After finding the nearest neighbors and computing the distances to them, the ∈ weighted adjacency matrix is built, representing a weighted unoriented graph, desc pairwise object similarities in the original dataset , as described in Section 2.1. For 30 randomly chosen hand-written digits from the dataset [30] with nearest bor count set to two, the neighborhood graph was built by the UMAP algorithm graph was represented by a weighted adjacency matrix ∈ ℝ × , as shown in 3.

Figure 3.
Graph represented by the weighted adjacency matrix ∈ ℝ × that was built UMAP algorithm for 30 randomly chosen images from the [30] dataset with neighbors coun to 2. The visualization was obtained using the graphviz tool [33].
For the datasets that were generated with the sklearn library and contain 1500 The UMAP algorithm was implemented in the Python programming language using such libraries as numpy [31] and numba [32], as described in Section 2.1. First, the UMAP algorithm searches for k nearest neighbors for every object in the original high dimensional dataset, and then computes distances to the k nearest neighbors. The k value is the hyperparameter of the UMAP algorithm. As we see later, choosing bigger k values might improve dataset global structure preservation while reducing the dimensionality. After finding the nearest neighbors and computing the distances to them, the M ∈ R n×n weighted adjacency matrix is built, representing a weighted unoriented graph, describing pairwise object similarities in the original dataset X, as described in Section 2.1.
For 30 randomly chosen hand-written digits from the dataset [30] with nearest neighbor count k set to two, the neighborhood graph was built by the UMAP algorithm. The graph was represented by a weighted adjacency matrix M ∈ R 30×30 , as shown in Figure 3. cell in such a matrix is characterized by color, encoded as an integer belonging to the [0, 16] interval. Every image from this dataset can be represented by a vector of shape ℝ , components of which are integers belonging to [0,16]. The visualization of handwritten digits from the [30] dataset created with sklearn [29] is shown in Figure 2. The UMAP algorithm was implemented in the Python programming language using such libraries as numpy [31] and numba [32], as described in Section 2.1. First, the UMAP algorithm searches for nearest neighbors for every object in the original high dimensional dataset, and then computes distances to the nearest neighbors. The value is the hyperparameter of the UMAP algorithm. As we see later, choosing bigger values might improve dataset global structure preservation while reducing the dimensionality. After finding the nearest neighbors and computing the distances to them, the ∈ ℝ × weighted adjacency matrix is built, representing a weighted unoriented graph, describing pairwise object similarities in the original dataset , as described in Section 2.1. For 30 randomly chosen hand-written digits from the dataset [30] with nearest neighbor count set to two, the neighborhood graph was built by the UMAP algorithm. The graph was represented by a weighted adjacency matrix ∈ ℝ × , as shown in Figure   3. Graph represented by the weighted adjacency matrix ∈ ℝ × that was built by the UMAP algorithm for 30 randomly chosen images from the [30] dataset with neighbors count set to 2. The visualization was obtained using the graphviz tool [33].
For the datasets that were generated with the sklearn library and contain 1500 points belonging to the ℝ space, UMAP computed distances to the nearest neighbors, and constructed a weighted adjacency matrix ∈ ℝ × . For the dataset containing 1797 hand-written digits represented by 64-dimensional vectors, UMAP computed distances to the nearest neighbors and constructed a weighted adjacency matrix ∈ ℝ × .

Coefficients Fitting
After constructing the fuzzy weighted undirected graph for each of the considered datasets, UMAP performs a search for and coefficients in function (6). The coefficients are chosen by least squares fitting of (6) against the curve (7). The shape of the curve (7) depends on the parameter . The plot illustrating how the variable affects the curve (6) shape is shown in Figure 4. . Graph represented by the weighted adjacency matrix M ∈ R 30×30 that was built by the UMAP algorithm for 30 randomly chosen images from the [30] dataset with neighbors count k set to 2. The visualization was obtained using the graphviz tool [33].
For the datasets that were generated with the sklearn library and contain 1500 points belonging to the R 2 space, UMAP computed distances to the k nearest neighbors, and constructed a weighted adjacency matrix M ∈ R 1500×1500 . For the dataset containing 1797 hand-written digits represented by 64-dimensional vectors, UMAP computed distances to the k nearest neighbors and constructed a weighted adjacency matrix M ∈ R 1797×1797 .

Coefficients Fitting
After constructing the fuzzy weighted undirected graph for each of the considered datasets, UMAP performs a search for a and b coefficients in function (6). The coefficients are chosen by least squares fitting of (6) against the curve (7). The shape of the curve (7) depends on the parameter d min . The plot illustrating how the d min variable affects the curve (6) shape is shown in Figure 4. The function (6) maps pairwise distances between two nearest neighbors into fuzzy values measuring the similarity of two objects . According to Figure 4, different parameter values lead to a curve different shape (6), meaning that different and coefficients get selected. With small values of , clusters in UMAP become denser.

Weighted Fuzzy Cross Entropy Loss Optimization
Using the weighted adjacency matrices obtained for each of the considered datasets with nearest neighbor count set set to 10, and the and coefficients selected by least squares fitting of (6) against (7) with = {0.1,1}, the weighted fuzzy cross entropy with reduced repulsion (5) was minimized using the Adam gradient-based optimization algorithm. The first-order partial derivative of (5) with respect to pairwise distances is given by (13), so the gradients were computed on every iteration according to: where ⃗ and ⃗ denote the i-th and j-th ℝ representations of objects from the original dataset ; denotes the distance between ⃗ and ⃗ in the ℝ space, computed according to (1) on every iteration of the Adam algorithm; ∈ denotes pairwise similarity of the original i-th and j-th objects from the dataset; and and denote the coefficients chosen by least squares fitting of (6) against (7) with a specified value. The parameters of the Adam optimization algorithm are listed in Table 1. For the dataset containing hand-written digits, each digit was assigned with its own color. The colors and the corresponding digits are listed in Figure 5.   The function (6) maps pairwise distances between two nearest neighbors into fuzzy values measuring the similarity of two objects ν ij . According to Figure 4, different d min parameter values lead to a curve different shape (6), meaning that different a and b coefficients get selected. With small values of d min , clusters in UMAP become denser.

Weighted Fuzzy Cross Entropy Loss Optimization
Using the weighted adjacency matrices obtained for each of the considered datasets with nearest neighbor count set k set to 10, and the a and b coefficients selected by least squares fitting of (6) against (7) with d min = {0.1, 1}, the weighted fuzzy cross entropy with reduced repulsion (5) was minimized using the Adam gradient-based optimization algorithm. The first-order partial derivative of (5) with respect to pairwise distances d ij is given by (13), so the gradients were computed on every iteration according to: where → y i and → y j denote the i-th and j-th R 2 representations of objects from the original dataset X; d ij denotes the distance between → y i and → y j in the R 2 space, computed according to (1) on every iteration of the Adam algorithm; µ ij ∈ M denotes pairwise similarity of the original i-th and j-th objects from the X dataset; and a and b denote the coefficients chosen by least squares fitting of (6) against (7) with a specified d min value.
The parameters of the Adam optimization algorithm are listed in Table 1. For the dataset containing hand-written digits, each digit was assigned with its own color. The colors and the corresponding digits are listed in Figure 5.  The function (6) maps pairwise distances between two near values measuring the similarity of two objects . According to parameter values lead to a curve different shape (6), meaning tha efficients get selected. With small values of , clusters in UMA

Weighted Fuzzy Cross Entropy Loss Optimization
Using the weighted adjacency matrices obtained for each of with nearest neighbor count set set to 10, and the and coef squares fitting of (6) against (7) with = {0.1,1}, the weight with reduced repulsion (5) was minimized using the Adam grad algorithm. The first-order partial derivative of (5) with respect t is given by (13), so the gradients were computed on every iteratio where ⃗ and ⃗ denote the i-th and j-th ℝ representations of o dataset ; denotes the distance between ⃗ and ⃗ in the ℝ cording to (1) on every iteration of the Adam algorithm; ∈ larity of the original i-th and j-th objects from the dataset; and efficients chosen by least squares fitting of (6) against (7) with a s The parameters of the Adam optimization algorithm are li dataset containing hand-written digits, each digit was assigned colors and the corresponding digits are listed in Figure 5.   The obtained visualizations for all of the considered datasets are shown in Figures 6 and 7. Figure 6 contains visualizations for d min = 1, Figure 7 contains visualizations for d min = 0.1. According to Figures 6 and 7, the weighted fuzzy cross entropy measure with reduced repulsion the UMAP algorithm successfully separates objects into several clusters. With d min = 1, the clusters in R 2 are less dense, compared to the clusters obtained with d min = 0.1.      According to Figure 6d,h and Figure 7d,h, the loss (5) works best when the nearest neighbor count k is set to a relatively small value. This happens due to the fact, that in this case the first term in (5) is equal to zero for all objects that are not nearest neighbors, as µ ij = 0 for non-neighbors, as described in Section 2.1. On the one hand, this allows one to separate the objects into more dense clusters, by applying the attractive force only to the nearest neighbors on every iteration. On the other hand, with relatively small k values the information of the global structure of a high dimensional dataset might be lost. For example, the handwritten digits two and seven are similar, but their clusters, as shown in Figures 6d and 7d, are separated from each other. The handwritten digits one and zero are less similar, however their clusters with d min set to one are rendered relatively close to each other.
With the sufficient increase of nearest neighbor count k by setting k = (n − 1), where n denotes object count in the considered dataset, to preserve more of the global structure of high dimensional data, the algorithm sometimes struggles with local structure preservation, as shown in Figures 6h and 7h. The first term in (32) stops being equal to zero for nonneighbors and the attractive force gets applied to every object in the dataset, but with different weighting terms µ ij .

Fuzzy Cross Entropy Loss Optimization
The fuzzy cross entropy loss is given by (15), and the first-order partial derivative of (15) with respect to d ij is given by (17). Using the obtained weighted adjacency matrices for the considered datasets with nearest neighbor count k set to 10 and (n − 1), where n denotes object count in the original high dimensional dataset, and the a and b values in (6) obtained by nonlinear least squares fitting of (6) against (7) with d min ∈ {0.1, 1}, the fuzzy cross entropy loss was minimized using Adam. The parameters of the Adam algorithm are listed in Table 1. The gradient of fuzzy cross entropy (15) with derivative given by (17) was computed on every iteration of Adam according to: where → y i and → y j denote the i-th and j-th R 2 representations of objects from the original dataset X; d ij denotes the distance between → y i and → y j in the R 2 space, computed according to (1) on every iteration of the Adam algorithm; µ ij ∈ M denotes pairwise similarity of the original i-th and j-th objects from the X dataset; and a and b denote the learned coefficients in (6) for a particular d min value in (7).
The visualizations of the considered datasets in the target low dimensional space R 2 are shown in Figures 8 and 9. The visualizations with d min set to 1 are shown in Figure 8, the visualizations with d min set to 0.1 are shown in Figure 9.     According to Figures 8 and 9, with nearest neighbors count k set to (n − 1), where n denotes object count in the original high dimensional dataset, the loss function (15) successfully separates objects into non-overlapping clusters. The global structure of the high dimensional datasets is preserved better when using (15) when compared to (5). According to the locations of clusters in Figures 8h and 9h, the three and nine handwritten digits are similar, as well as two and seven, four and six, and their clusters are rendered close to each other. The zero and one digits are less similar, and their clusters are spread away from each other. According to Figures 8 and 9, it is better to use the (15) loss with k = (n − 1). With k = 10 the algorithm might struggle to preserve global distances.

Symmetric Fuzzy Cross Entropy Loss Optimization
The symmetric fuzzy cross entropy is given by (18), and the first-order partial derivative of (18) with respect to d ij is given by (25). For the symmetric fuzzy cross entropy loss, the nearest neighbor count k was also set to 10 and (n − 1), where n denotes object count.
In the case of symmetric fuzzy cross entropy, we also expected that setting k = (n − 1) would help to preserve the global structure of the data. The parameters of Adam were set according to Table 1, the gradient of (18) was computed based on its derivative (25) on every iteration according to the following formula: where → y i and → y j denote the i-th and j-th R 2 representations of objects from the original dataset X; d ij denotes the distance between → y i and → y j belonging to the R 2 space computed according to (1) on every iteration of the Adam algorithm; µ ij ∈ M denotes pairwise similarity of the original i-th and j-th objects from the X dataset; and a and b denote the learned coefficients in (6) for a particular d min value in (7).
The obtained visualizations are shown in Figures 10 and 11. According to the visualizations, the use of (18) with k = (n − 1) also allows one to separate objects into dense clusters. The positions of the clusters shown in Figures 10 and 11

Modified Fuzzy Cross Entropy Loss Optimization
The modified fuzzy cross entropy proposed in [23] is given by (26), the derivative of (26) is given by (31). The preliminary experiments have shown that with relatively small nearest neighbor k, the loss suffers with both local and global structure preservation of the original dataset. Hence, we set the k value to (n − 1), where n denotes object count in X. The parameters of the Adam algorithm were set according to Table 1. The gradients of (26) were computed on every iteration according to the following formula: where → y i and → y j denote the i-th and j-th R 2 representations of objects from the original dataset X; d ij denotes the distance between → y i and → y j in the R 2 space, computed according to (1) on every iteration of the Adam algorithm; µ ij ∈ M denotes pairwise similarity of the original i-th and j-th objects from the X dataset; and a and b denote the coefficients chosen by least squares fitting of (6) against (7) with a specified d min value.
The obtained visualizations with d min = 0.1 are shown in Figure 12.

Modified Fuzzy Cross Entropy Loss Optimization
The modified fuzzy cross entropy proposed in [23] is given by (26), the derivative of (26) is given by (31). The preliminary experiments have shown that with relatively small nearest neighbor , the loss suffers with both local and global structure preservation of the original dataset. Hence, we set the value to ( − 1), where denotes object count in . The parameters of the Adam algorithm were set according to Table 1. The gradients of (26) were computed on every iteration according to the following formula: where ⃗ and ⃗ denote the i-th and j-th ℝ representations of objects from the original dataset ; denotes the distance between ⃗ and ⃗ in the ℝ space, computed according to (1) on every iteration of the Adam algorithm; ∈ denotes pairwise similarity of the original i-th and j-th objects from the dataset; and and denote the coefficients chosen by least squares fitting of (6) against (7) with a specified value. The obtained visualizations with = 0.1 are shown in Figure 12. According to Figure 12, the modified fuzzy cross entropy that is given by (26) is also able to find clusters in high dimensional space and embed the clusters into ℝ . The locations and shapes of the clusters are similar to the locations obtained by using other losses, as shown in Figures 8-11. However, as we see in Figure 12c,d, there is plenty of objects which do not belong to any of the clusters. The loss (26) did not manage to discover the clusters of handwritten digits such as five and eight.

Discussion
In this research, we considered different loss functions used during the low dimensional embedding construction process in the UMAP algorithm applied to multidimensional data visualization. In order to achieve this, we reimplemented the UMAP algorithm from scratch [10], with an intention to make the incorporation of custom losses into the original algorithm possible. The original implementation of the considered dimensionality reduction technique uses a sampling-based approach inspired with stochastic gradient descent while performing loss function optimization, and this leads to a different weighting of fuzzy cross entropy terms [22] when compared to traditional fuzzy cross entropy defined in [20,21]. Based on the findings published in [22], we explicitly defined the fuzzy cross entropy loss with reduced repulsion weight, derived the gradients analytically ignoring the normalization, and optimized the obtained loss using the first-order gradient-based Adam algorithm, without using the sampling-based approach. Other considered loss functions include the original fuzzy cross entropy without term weighting According to Figure 12, the modified fuzzy cross entropy that is given by (26) is also able to find clusters in high dimensional space and embed the clusters into R 2 . The locations and shapes of the clusters are similar to the locations obtained by using other losses, as shown in Figures 8-11. However, as we see in Figure 12c,d, there is plenty of objects which do not belong to any of the clusters. The loss (26) did not manage to discover the clusters of handwritten digits such as five and eight.

Discussion
In this research, we considered different loss functions used during the low dimensional embedding construction process in the UMAP algorithm applied to multidimensional data visualization. In order to achieve this, we reimplemented the UMAP algorithm from scratch [10], with an intention to make the incorporation of custom losses into the original algorithm possible. The original implementation of the considered dimensionality reduction technique uses a sampling-based approach inspired with stochastic gradient descent while performing loss function optimization, and this leads to a different weighting of fuzzy cross entropy terms [22] when compared to traditional fuzzy cross entropy defined in [20,21]. Based on the findings published in [22], we explicitly defined the fuzzy cross entropy loss with reduced repulsion weight, derived the gradients analytically ignoring the normalization, and optimized the obtained loss using the first-order gradient-based Adam algorithm, without using the sampling-based approach. Other considered loss functions include the original fuzzy cross entropy without term weighting [20,21], symmetric fuzzy cross entropy [20,21], and modified fuzzy cross entropy, proposed in [23]. The gradients for all of the considered losses were determined analytically in order to make optimization possible using the first-order Adam algorithm without the need for numerical gradient computation.
During the numerical experiment, we considered both multidimensional and twodimensional datasets. Mutual displacements of objects belonging to R 2 can be easily visualized (see Figure 1), and then their positions can be compared with the embeddings obtained after applying UMAP-based transformations (see . This allows one to visually determine how good a manifold learning algorithm is at preserving the local and global structure of the original dataset when performing dimensionality reduction. The obtained visualizations confirm that the fuzzy cross entropy loss with or without reduced repulsion, as well as the symmetric fuzzy cross entropy loss, is able to discover clusters in the original datasets and map them into the target space, preserving the structure of the original datasets. The visualizations of embeddings obtained by applying UMAP to datasets containing objects belonging to R 2 show that the choice of a loss function greatly affects the result. The fuzzy cross entropy with reduced repulsion that is used in the original UMAP algorithm [22] works best with small nearest neighbor count values k, and is very good at preserving local structure (see Figure 6a-c). Other considered losses perform best with sufficiently large k values. For example, when k is set to (n − 1), where n denotes object count in the original dataset, the algorithm preserves most of the global structure (see Figures 8 and 10).
The visualizations of high dimensional handwritten digits show that the weighted fuzzy cross entropy loss with reduced repulsion is able to separate data into non-overlapping clusters only for relatively small neighbor counts k. With sufficiently large k values UMAP struggles to preserve local structure of the original high dimensional dataset (see Figures 6h and 7h). Losses such as fuzzy cross entropy and symmetric fuzzy cross entropy with nearest neighbor count k set to (n − 1), where n denotes object count in the original dataset, successfully preserve both the local and global structure of the original datasets. With k = (n − 1), the symmetric fuzzy cross entropy loss (18) produces clusters with objects packed more densely, as shown in Figures 10h and 11h, and the fuzzy cross entropy loss (15) distributes objects more uniformly in R 2 , while preserving the shape and mutual arrangement of the clusters, as shown in Figures 8h and 9h. The use of the modified fuzzy cross entropy (26) leads to the inability of the algorithm to visualize non-overlapping clusters of some types of objects, as shown in Figure 12d.

Conclusions
The obtained results show that the use of fuzzy cross entropy without reduced repulsive weight, as well as symmetric cross entropy with sufficiently large nearest neighbor count k, can enhance the global structure preservation of the original dataset. This could be useful for the visual interpretation of high dimensional data in many different domains, such as medical diagnosis [34] or single cell RNA sequences clustering [35]. Dimensionality reduction algorithms also find their applications in data preprocessing [36] in order to enhance clustering or classification algorithm accuracy.
Further research could cover performance investigation of other fuzzy cross entropies used as loss functions in the UMAP algorithm, such as Tsallis divergence [37,38], fuzzy exponential cross entropy [39] and other divergence measures between two fuzzy sets. Additionally, further work could focus on deriving losses based on the principles highlighted in [19]. The approach to multidimensional data visualization presented in this paper, however, is not sampling-based, so further research could focus on developing sampling-based iterative schemes for the considered losses, similar to the scheme used in the UMAP reference implementation [10], aimed to improve the speed and reduce the computational complexity of the iterative loss function optimization process.