Skip to Content
MAKEMachine Learning and Knowledge Extraction
  • Article
  • Open Access

18 February 2026

Novel Loss Functions for Improved Data Visualization in t-SNE

,
and
1
College of Science and Engineering, Hamad Bin Khalifa University, Education City, Gate 8, Doha P.O. Box 5825, Qatar
2
Department of Computer Science, Bishop’s University, 2600 College Street, Sherbrooke, QC J1M 1Z7, Canada
*
Author to whom correspondence should be addressed.
This article belongs to the Section Visualization

Abstract

A popular method for projecting high-dimensional data onto a lower-dimensional space while preserving the integrity of its structure is t-distributed Stochastic Neighbor Embedding (t-SNE). This technique minimizes the Kullback–Leibler ( K L ) divergence to align the similarities between points in the original and reduced spaces. While t-SNE is highly effective, it prioritizes local neighborhood preservation, which results in limited separation between distant clusters and inadequate representation of global relationships. To improve these limitations, this work introduces two complementary approaches: (1) The Max-Flipped K L Divergence ( K L max ) modifies the original divergence by incorporating a contrastive term, K L , which enhances the ranking of point similarities through maximum similarity constraints. (2) The K L -Wasserstein Loss ( L K L W ) combines the K L divergence with the classic Wasserstein distance, allowing the embedding to benefit from the smooth and geometry-aware transport properties of Wasserstein metrics. Experimental results show that these methods lead to improved separation and better structural clarity in the low-dimensional space compared to standard t-SNE.

1. Introduction

A core challenge in exploratory data analysis is producing two-dimensional views that are both locally faithful and globally interpretable. Popular methods include Principal Component Analysis (PCA) [1], Uniform Manifold Approximation and Projection (UMAP) [2], and t-distributed Stochastic Neighbor Embedding (t-SNE) [3]. Among these, t-SNE is widely adopted for preserving local neighborhoods, which makes clusters visually salient. However, optimizing the Kullback–Leibler ( K L ) divergence in the standard objective has known side effects: it tends to under-represent global relationships (inter-cluster layout, long-range structure), is sensitive to a few large P i j values, and its asymmetry makes it more punitive for missing close neighbors than for misplacing distant ones. These behaviors can limit interpretability when cluster arrangement and global trends matter.
Understanding the tension between local and global fidelity is critical in many practical applications. For example, in single-cell biology, similar cell types may form distinct clusters that should still reflect developmental trajectories; meanwhile, in image retrieval and language modeling, globally meaningful structure supports semantic grouping and hierarchy discovery. Existing methods have sought to balance this trade-off: f-divergence variants of t-SNE re-weight neighborhood probabilities to adjust local sensitivity, while global structure-aware algorithms such as UMAP introduce additional terms that encourage distant clusters to retain relative positioning. Despite these efforts, there remains no simple formulation that simultaneously enforces local ranking consistency and global geometric alignment within the original t-SNE optimization framework.
To address this gap, this work introduces two enhanced formulations that extend the t-SNE framework: (1) Max-Flipped K L Divergence ( K L max ) and (2) K L –Wasserstein Loss ( L K L W ). Each method is designed to mitigate specific shortcomings of the original t-SNE formulation while improving the separation and interpretability of resulting visualizations.
First, Max-Flipped K L Divergence ( K L max ) augments the traditional K L divergence with an additional term, K L , that focuses on discrepancies in similarity rankings. By explicitly penalizing violations in expected similarity orderings, this extension improves the model’s sensitivity to subtle structural variations and enhances separation between distinct groups.
Second, K L –Wasserstein Loss ( L K L W ) introduces a hybrid objective that combines K L divergence with the classic Wasserstein distance. This integration leverages the Wasserstein metric’s ability to capture smooth, geometry-aware transport between distributions, thus reinforcing the separation between clusters while preserving meaningful local arrangements.
In contrast to prior t-SNE extensions based on alternative f-divergences such as Jensen–Shannon, reverse KL, and other symmetric formulations [4,5,6,7,8], as well as OT-regularized or transport-based embedding methods [9,10,11,12,13,14], the proposed K L max and L K L W differ in two fundamental aspects. K L max introduces a similarity-ranking inversion mechanism that directly penalizes deviations from the local maximum probability, a behavior not captured by symmetric or alternative f-divergence objectives [7,8]. L K L W incorporates a direct Wasserstein W 1 alignment term in the embedding space, unlike sliced-OT or entropy-regularized OT variants that operate in the data space or rely on smoothed surrogate distances [9,11,12]. This distinction allows our formulation to jointly address local ranking fidelity and global geometric consistency.
Recent work in representation learning has shown that preserving semantic relationships in complex data is essential for obtaining globally coherent embeddings. For example, robust feature representation methods for human parsing in complex scenes [15] emphasize maintaining consistent global structure, which is conceptually aligned with our objective of enhancing global structure consistency through the Wasserstein term. In addition, robustness to geometric transformations has been studied in other domains, such as light-field image watermarking. Techniques designed to withstand multidimensional geometric attacks [16] emphasize the importance of preserving structural consistency under complex transformations. Although the application domain differs, this broader theme is conceptually aligned with our motivation for incorporating the Wasserstein distance to strengthen the geometric structure preservation capability of the embedding.
Together, these considerations motivate the need for embedding objectives that more faithfully preserve both local and global relationships—a goal that the proposed K L max and L K L W formulations are specifically designed to address.

2. t -Distributed Stochastic Neighbour Embedding ( t -SNE)

t-SNE is a widely adopted method for visualizing high-dimensional data. It relies on K L divergence to align probability distributions in the original high-dimensional space with those in the low-dimensional embedding. The primary goal of t-SNE is to minimize the divergence between two distributions:
(1)
The high-dimensional probability distribution P i j , which models the similarity between points x i and x j using a Gaussian kernel:
P j | i = exp ( x i x j 2 / 2 σ i 2 ) k i exp ( x i x k 2 / 2 σ i 2 )
where σ i is a local bandwidth parameter. The final symmetric joint probability is given by:
P i j = P j | i + P i | j 2 n
(2)
The low-dimensional probability distribution Q i j , modeled with a Student’s t-distribution to account for distant similarities:
Q i j = ( 1 + y i y j 2 ) 1 k l ( 1 + y k y l 2 ) 1
The optimization objective in t-SNE is to minimize the K L divergence between P i j and  Q i j :
K L ( P Q ) = i j i P i j log P i j Q i j
This ensures that the low-dimensional representation preserves the pairwise similarities of the original space as accurately as possible. The gradient of the K L divergence with respect to the low-dimensional embeddings is:
y i K L ( P Q ) = 4 j i ( P i j Q i j ) ( y i y j ) ( 1 + y i y j 2 ) 1
This gradient is used to iteratively update the embeddings via gradient descent, ensuring the local structure is well preserved.
Although t-SNE is highly effective for clustering and visualizing high-dimensional data, its dependence on the K L divergence introduces several well-documented drawbacks. First, because the K L objective primarily focuses on minimizing local reconstruction errors, it tends to emphasize neighborhood-level relationships while neglecting large-scale structures such as hierarchies, inter-cluster layouts, and global trends [17,18,19]. As a result, clusters that are meaningfully related in the original space may appear arbitrarily separated or distorted in two-dimensional projections, reducing interpretability in tasks that depend on global context. Second, the asymmetric nature of K L divergence makes it more sensitive to underestimations in Q i j than to overestimations, causing t-SNE to penalize mising close neighbors far more severely than misplacing distant ones. This imbalance strengthens local compactness but further weakens the preservation of global geometry. Addressing these limitations motivates our proposed extensions, which explicitly target both local ranking consistency and global geometric alignment.

3. Wasserstein Distance

The p-Wasserstein distance ( p [ 1 , ) ) quantifies the minimum “effort” required to transport probability mass from one distribution to another over a metric space ( Ω , d ) . It unifies continuous (density-based) and discrete (empirical) settings under the same optimal transport (OT) framework [20].
Given probability measures P and Q on ( Ω , d ) , the p-Wasserstein distance is
W p ( P , Q ) = inf γ Γ ( P , Q ) Ω × Ω d ( x , y ) p d γ ( x , y ) 1 / p ,
where Γ ( P , Q ) is the set of couplings (transport plans) with the correct marginals:
γ ( A × Ω ) = P ( A ) , γ ( Ω × B ) = Q ( B ) for all Borel A , B Ω .
When P is absolutely continuous with respect to the Lebesgue measure, the Kantorovich problem admits a map solution in many cases. The Monge formulation searches for a measurable transport map f : Ω Ω pushing P to Q (denoted f # P = Q ; i.e., Q ( A ) = P f 1 ( A ) for all Borel A):
W p ( P , Q ) = inf f : f # P = Q Ω d x , f ( x ) p d P ( x ) 1 / p .
For p = 2 on Ω R d , [21] shows that the optimal map exists and is the gradient of a convex potential (under mild conditions), linking (8) and (6).
In one dimension, the optimal map is monotone. Let F P and F Q be the CDFs of P and Q, and F P 1 , F Q 1 their quantile functions. Then
W p ( P , Q ) = 0 1 | F P 1 ( u ) F Q 1 ( u ) | p d u 1 / p .
Equivalently, f ( x ) = F Q 1 ( F P ( x ) ) yields the Monge map and recovers (9) via (8).
For discrete/empirical measures, let P = i = 1 n a i δ x i and Q = j = 1 m b j δ y j with a i , b j 0 , i a i = j b j = 1 . Writing C i j = d ( x i , y j ) p , the OT problem becomes a finite linear program:
W p ( P , Q ) p = min Π R + n × m i = 1 n j = 1 m Π i j C i j s . t . j Π i j = a i , i Π i j = b j .
This discrete formulation is the Kantorovich linear program [22]. When n = m and a i = b j = 1 n , (10) searches over doubly-stochastic Π . In one dimension with uniform weights, sorting the supports gives a closed form:
W p ( P , Q ) p = 1 n i = 1 n | x ( i ) y ( i ) | p ,
where x ( 1 ) x ( n ) and y ( 1 ) y ( n ) .
For general ( n , m ) , (10) has n m variables and is typically solved by specialized LP/OT solvers. Exact network-simplex methods scale superlinearly (often near-cubic in n in practice), while entropic regularization leads to fast Sinkhorn matrix scaling with O ( n m ) work per iteration. In one dimension, computing W p reduces to sorting (cost O ( n log n ) ) and then linear time; if samples are pre-sorted, the evaluation is O ( n ) .
Equations (6)–(10) provide a unified framework for both continuous measures (via maps or plans) and discrete empirical distributions (via finite couplings), while the one-dimensional case (9)–(11) admits closed-form solutions [23].

4. Proposed Methods

This section introduces two methods that directly target the well-known limitations of t-SNE. Standard t-SNE’s asymmetric K L ( P Q ) overemphasizes missed neighbors, tolerates false ones, and thus causes crowding and unreliable global geometry. We address this with K L max to enforce bidirectional fidelity and reduce false neighbors, and with L KL W = K L + λ W to preserve global structure via a transport-aware term. Together, these objectives lower MSE, stabilize embeddings, and yield clearer, more interpretable separations with trustworthy inter-cluster distances for downstream analysis.

4.1. Max-Flipped K L Divergence

The proposed K L m a x builds on the original loss by introducing an additional term, K L , which focuses on differences in similarity rankings. By capturing deviations from the highest similarity values in both the high-dimensional and low-dimensional spaces, K L ensures that not just the raw probabilities, but also their relative importance, are preserved. This enhancement enables K L m a x to provide a more balanced embedding, effectively representing relationships between pairs of neighbors rather than focusing on a single dominant relation.
The K L m a x divergence is defined as
K L m a x ( P Q ) = K L ( P Q ) + α K L ( P Q ) ,
where K L ( P Q ) is provided in Equation (4), and
K L ( P Q ) = i j i P i j log P i j Q i j .
Here,
P i j = max ( P i · ) P i j ,
Q i j = max ( Q i · ) Q i j ,
and both P i j and Q i j are subsequently normalized so that j i P i j = 1 and j i Q i j = 1 for each i.
This transformation effectively “flips” the focus of the divergence from comparing probabilities directly to measuring their deviations from the maximum probabilities within their respective distributions. Instead of emphasizing the values of P i j or Q i j themselves, K L m a x prioritizes the distances of P i j and Q i j from their most probable counterparts, max ( P i · ) and max ( Q i · ) .
The gradient of K L m a x with respect to the low-dimensional embeddings y i is the sum of the gradients of the two components:
y i K L m a x ( P Q ) = y i K L ( P Q ) + α y i K L ( P Q ) .
The gradient of K L ( P Q ) is given in Equation (5). Using the Student-t kernel q i j ( 1 + y i y j 2 ) 1 and the standard t-SNE force decomposition, the contribution of K L ( P Q ) takes the same force form with ( P , Q ) :
y i K L ( P Q ) = 4 j i P i j Q i j ( y i y j ) 1 + y i y j 2 .
Combining (16) with (17) and the standard K L term yields the final expression:
y i K L m a x ( P Q ) = 4 j i ( P i j Q i j ) + α ( P i j Q i j ) ( y i y j ) 1 + y i y j 2 .
To avoid undefined terms when Q i j = 0 with P i j > 0 , we (i) apply the same ε -floor to both P and Q , and (ii) compute K L on the union of active indices per row (equivalently, after flooring-and-renormalization all entries are strictly positive). With these two steps, K L is well-defined.

4.2. K L –Wasserstein Loss

The proposed K L –Wasserstein objective augments the standard t-SNE loss with a transport-based regularization term. Formally, the combined loss is
L K L W = K L ( P Q ) + β W p ( P , Q ) ,
where K L ( P Q ) is the conventional similarity-based divergence (Equation (4)). The Wasserstein component W p ( P , Q ) captures global geometric discrepancies between the original data X = { x i } i = 1 n R D and the embedded points Y = { y i } i = 1 n R 2 . In our formulation, we evaluate the first Wasserstein distance W 1 between the empirical measures μ X = 1 n i = 1 n δ x i and μ Y = 1 n i = 1 n δ y i , computed using the wasserstein_distance_nd routine from scipy.stats. This computation is applied on mini-batches and incurs only moderate cost.
Because wasserstein_distance_nd provides only the scalar distance, we estimate its gradient with respect to the embedding Y using a stochastic finite-difference approximation inspired by SPSA. Small random perturbations are applied to Y, and the resulting changes in W 1 are used to approximate W 1 / Y .
For reference, the analytic subgradient associated with an explicit optimal assignment σ is
y i W 1 ( μ X , μ Y ) = 1 n y i x σ ( i ) y i x σ ( i ) 2 ,
or in a smoothed form,
y i x σ ( i ) y i x σ ( i ) 2 2 + δ 2 , δ [ 10 6 , 10 3 ] ,
which avoids division by zero while converging to the true subgradient as δ 0 . In practice, because recomputing σ at each iteration is computationally expensive, our implementation relies on the finite-difference estimator instead of the analytic form.
The K L and Wasserstein terms contribute complementary geometric information to the embedding. The divergence K L ( P Q ) induces short-range attractive forces that preserve local neighborhood relations, becoming increasingly influential as clusters form. In contrast, the Wasserstein component provides global structural guidance by encouraging each embedded point y i to align with its high-dimensional counterpart x σ ( i ) .
To illustrate their interaction, consider the gradient of the combined objective,
y i L K L W = y i K L ( P Q ) + β y i W 1 ( μ X , μ Y ) .
The K L gradient takes the well-known t-SNE form,
y i K L ( P Q ) = 4 j i ( P i j Q i j ) ( y i y j ) 1 + y i y j 2 ,
which strengthens local attraction whenever P i j > Q i j .
In contrast, away from assignment changes, a valid subgradient of W 1 is
y i W 1 ( μ X , μ Y ) = 1 n y i x σ ( i ) y i x σ ( i ) 2 ,
which pulls each embedded point toward its high-dimensional origin. This long-range signal counteracts crowding and improves the global layout.
Although the two forces may point in different directions at times, their effects are complementary. The Wasserstein component typically dominates in the early iterations, promoting a globally coherent structure, while the K L term refines local neighborhoods as optimization progresses. Consequently, the combined objective achieves both improved global organization and enhanced local fidelity compared with standard t-SNE.

5. Experiments

This section evaluates the performance of the proposed methods, K L max and L K L W , across a variety of datasets. We organize the experiments as follows: (i) we first evaluate the impact of the weighting parameters α and β on the quality of the embeddings; (ii) we then examine the evolution of the loss function across optimization iterations; (iii) next, we compare the clustering and structure preservation performance of K L max and L K L W against the baseline using quantitative metrics; and (iv) finally, we provide qualitative assessments through 2D visualizations of the resulting embeddings. All experiments were conducted on a MacBook Pro (13-inch, M1, 2020) running macOS Sequoia 15.1 with an Apple M1 chip and 8 GB of RAM. These hardware specifications are provided to contextualize the computational performance of the proposed methods.

5.1. Datasets

We conducted experiments on a diverse collection of benchmark datasets, including Pendigits [24], MNIST [25], Fashion-MNIST [26], COIL-20 [27], and Olivetti Faces [28]. These datasets vary in terms of structure, modality, and complexity, allowing us to evaluate the effectiveness of our proposed methods under different distributional settings.
  • MNIST contains 70,000 handwritten digit images (28 × 28 pixels, grayscale), evenly distributed across 10 classes. It is widely used for evaluating clustering quality in embedding spaces.
  • Pendigits comprises 1797 instances of pen-based digit trajectories (8 × 8 grayscale images) representing digits 0 to 9, each assigned to one of 10 classes.
  • Fashion-MNIST includes 70,000 grayscale images (28 × 28 pixels) of clothing items from 10 fashion categories. Compared to MNIST, this dataset presents more subtle visual variations between classes.
  • Olivetti Faces consists of 400 grayscale facial images (64 × 64 pixels) of 40 individuals, with 10 different facial expressions or lighting conditions per subject. This introduces intra-class variation and is suitable for testing embedding robustness.
  • COIL-20 contains 1440 grayscale images of 20 objects taken from 72 different viewpoints (5-degree increments), introducing continuous rotational changes. Successful embeddings must preserve the geometric progression of these views.
In contrast, synthetic datasets such as Swiss Roll and Three Gaussians provide controlled environments for evaluating structure preservation. Real-world datasets present more complex distributions. MNIST, Pendigits, and Fashion-MNIST, composed of grayscale images with well-separated categories, serve as benchmarks for cluster fidelity. Olivetti Faces challenges embeddings with variation in lighting and expression, while COIL-20 emphasizes the need for preserving smooth geometric transformations.

5.2. Evaluation Metrics

We assess the performance of the K-means algorithm across various embedding spaces using both external and internal validation metrics. The external metrics include the F1-score [29,30] and Normalized Mutual Information (NMI) [31], both of which yield scores from 0 to 1, where higher values indicate better alignment with ground truth labels. For internal validation, we use the Davies–Bouldin Index (DBI) [32], which quantifies cluster compactness and separation. Lower DBI values indicate better clustering performance.

5.3. Parameter Sensitivity

To analyze the influence of parameter weighting in our proposed loss functions, we conducted a sensitivity analysis of α (for K L max ) and β (for L K L W ) on two datasets: Pendigits and COIL-20. We evaluated both quantitative results using NMI, F1-score, and DBI, as well as qualitative 2D visualizations of the embeddings.
While the full sensitivity sweep of α (for K L max ) and β (for L K L W ) was performed on Pendigits and COIL-20, these two datasets were selected because they represent complementary geometric regimes: discrete, well separated clusters and smooth manifold structure. These regimes are known to generalize the behavior of neighbor-embedding objectives on high-dimensional image datasets such as MNIST and Fashion-MNIST. In practice, we found that applying the same parameter ranges ( α 0.5 0.7 and β 250 –500) to the remaining datasets produced stable embeddings without degradation, suggesting that these values are reasonably robust across datasets.

5.3.1. Effect on NMI and F1-Score

Figure 1 and Figure 2 illustrate how NMI and F1-score change under varying α and β values. For the Pendigits dataset, NMI initially increases slightly as α moves from 0.2 to 0.5, then drops sharply beyond α = 0.7 . This suggests that a moderate influence from K L max enhances class separability, while excessive emphasis leads to embedding degradation.
Figure 1. Effect of alpha and beta on performance based on Pendigits dataset.
Figure 2. Effect of alpha and beta on performance based on COIL-20 dataset.
A similar trend is observed with β : NMI and F1 peak at β = 250 –500, before degrading at β = 1000 . These results confirm that both loss functions require careful parameter tuning to balance structural preservation and noise amplification. In the COIL-20 dataset, optimal NMI and F1 are reached at α = 0.7 and β = 250 . However, performance drops at higher values, indicating over-regularization may suppress the dataset’s inherent geometric continuity.

5.3.2. Effect on DBI and Interpretation

Interestingly, the DBI values for COIL-20 remain nearly constant across parameter changes, even when NMI and F1 degrade significantly (see Figure 2). This suggests that while the semantic alignment with class labels is affected, the geometric structure of the clusters remains stable. This is a direct result of the DBI’s nature. It evaluates clustering quality based on intra-cluster compactness and inter-cluster separation, without relying on ground truth labels.
In contrast, for Pendigits (Figure 1), DBI trends align closely with NMI and F1. Since Pendigits consists of discrete digit classes, degradation in the embedding affects both structural and label-based clustering quality.
These differences highlight that DBI is more suitable for assessing structure-aware embeddings, particularly in datasets like COIL-20, where the class boundaries follow a smooth, continuous transformation (e.g., object rotation). The model may preserve structural integrity (low DBI) even as class labels become less distinguishable (low NMI/F1), especially under high α or β values.

5.3.3. Effect on Embedding Structure

These trends are reflected in the visualizations in Figure 3 and Figure 4. In Pendigits (Figure 3), increasing α initially improves cluster tightness, but at α = 0.7 and above, the layout collapses or degenerates into scattered blobs. Similar degradation is observed for large β values.
Figure 3. The change effect of α and β parameters on Pendigits dataset in terms of visualization on K L m a x (Row 1) and L K L W (Row 2).
Figure 4. The change effect of α and β parameters on COIL-20 dataset in terms of visualization K L m a x (Row 1) and L K L W (Row 2).
For COIL-20 (Figure 4), low α and β values produce visually clean and well-separated loops corresponding to object rotations. At higher values, the embeddings become tangled, yet the overall shape and spacing between clusters are partially preserved, explaining why DBI remains low despite declining NMI and F1.

5.3.4. Summary

Moderate parameter values ( α = 0.5 0.7 , β = 250 –500) consistently yield the best trade-off between local fidelity and global structure, in terms of both external (NMI, F1) and internal (DBI) metrics. The near-constant DBI observed in COIL-20 underscores that structural cluster quality can remain high even when class-label agreement is weakened, an important insight for applications involving manifold or rotational structures. These findings reinforce the value of DBI as a robust structure, preservation metric and highlight the importance of understanding dataset-specific geometry when tuning model parameters.

5.4. Convergence Analysis

Despite the slightly higher loss values observed in the Kernel Density Estimation (KDE) plots in Figure 5, these increases are expected due to the inclusion of additional structural terms in the proposed losses. It is important to emphasize that higher numerical loss does not imply degraded embedding quality. Instead, the added terms in K L max and L K L W help guide the optimization toward embeddings that are more structurally meaningful. This is validated by the consistently lower DBI scores and stronger clustering metrics (e.g., NMI and F1-score) achieved by our methods across all datasets.
Figure 5. Kernel Density Estimation (KDE) plots illustrating the distribution of loss values based on 50 independent runs for each method. From left to right: K L , K L max , and L K L W . From top to bottom: Pendigits, 5 classes from 5000 digits of MNIST, Coil-20, and FMNIST datasets.
The KDE plots, based on 50 independent runs per method and dataset, reveal that both K L max and L K L W exhibit tighter and more symmetric loss distributions compared to standard K L divergence. This reflects greater optimization stability and reduced sensitivity to random initialization, suggesting that the proposed methods converge more consistently to robust minima.
To statistically verify these observations, we employed two non-parametric tests—the Kolmogorov–Smirnov (KS) test and the Mann–Whitney U test—because they make minimal distributional assumptions, are robust to outliers and skew, and provide complementary sensitivity: KS detects any distributional change (location, scale, or shape), while Mann–Whitney focuses on differences in central tendency dominance between groups [33,34]. The results, reported in Table 1, show that for all dataset comparisons, the p-values are below the significance threshold of 0.05. This confirms that the differences in loss distributions between KL and our proposed losses are statistically significant.
Table 1. Statistical comparison of loss distributions across K L , K L max , and L K L W using KS and Mann–Whitney U tests (p-values). The reported p-values appear as 0.0000 because they were rounded to four decimal places in the output; all are smaller than 0.0001 (<0.0001).
These findings further strengthen our conclusion that K L max and L K L W not only improve the quality of low-dimensional embeddings in terms of structure preservation and cluster separation, but also provide more stable and reproducible optimization behavior.

5.5. Computational Complexity Analysis

To evaluate the practical feasibility of the proposed loss functions, we measured the total execution time required to optimize the embeddings on four benchmark datasets. Table 2 summarizes the runtime (in seconds) of standard t-SNE with K L divergence and the modified versions using K L max and L K L W .
Table 2. Execution time (in seconds) of each method on the benchmark datasets.
As expected, both K L max and L K L W introduce additional computational overhead compared to the baseline. This increase stems from the inclusion of extra components in the objective function, ranking deviation terms in K L max and L K L W . However, this added cost is relatively modest.
Importantly, the improvements in embedding quality reflected in superior clustering scores, lower DBI values, and increased visual interpretability justify the modest increase in runtime. The structure aware regularization terms provide better guidance during optimization, resulting in embeddings that are both more informative and more stable, without incurring prohibitive computational costs.

5.6. Comprehensive Evaluation of Clustering, Structure, and Reconstruction

This section evaluates the proposed divergence measures, K L max and L K L W , against the standard K L divergence in terms of clustering quality, structural preservation, and reconstruction accuracy. We apply the methods across six benchmark datasets with varying characteristics, including synthetic and real-world data.

5.6.1. Clustering Metrics

As shown in Table 3, both proposed losses outperform the standard K L divergence in terms of F1-score and NMI across most datasets. On simpler datasets such as 3-Gaussians, all methods perform comparably. However, on more challenging datasets like Pendigits, MNIST, and COIL-20, K L max and L K L W achieve clear improvements. For instance, in Pendigits, L K L W improves the NMI from 0.833 to 0.872 and the F1-score from 0.852 to 0.908. These enhancements demonstrate better alignment between the embedded space and the underlying class structure.
Table 3. Comparison of Metrics for K L , K L max , and L K L W .

5.6.2. Internal Clustering Quality

The DBI is used to assess cluster compactness and separation. As shown in Table 3, both K L max and L K L W generally yield lower DBI scores than the baseline K L , indicating tighter and more distinct clusters. For instance, on the COIL-20 dataset, the DBI decreases from 0.637 ( K L ) to 0.600 ( L K L W ), reflecting improved structural grouping. However, the slightly lower DBI of the baseline on Fashion-MNIST can be attributed to that dataset’s simple, compact cluster structure, its local neighborhoods are already well defined, favoring methods that prioritize local cohesion. In contrast, our proposed methods emphasize global geometric alignment, so a small trade-off in intra-cluster compactness is expected.

5.6.3. Visual Inspection of Embeddings

Figure 6 provides qualitative comparisons of the 2D embeddings produced by each method. Embeddings generated using L K L W show more distinct and well-separated clusters, especially in datasets with complex manifolds like Swiss Roll and COIL-20. The consistency and continuity in structure indicate that L K L W effectively captures long-range relationships in the data. Meanwhile, K L max produces embeddings with sharper cluster boundaries and reduced within-cluster variance, showing improvements over standard K L divergence in terms of local structure.
Figure 6. Visual comparison of embeddings using, from left to right, K L , K L max , and L K L W ; and from top to bottom, 3-Gaussian clusters, Swiss Roll, Pendigits, 5 classes of 5000 MNIST digits, Coil-20, and FMNIST datasets. In all panels, point colors indicate the ground-truth class or cluster label in the corresponding dataset (e.g., Gaussian component, roll arm, digit identity, or object category).

5.6.4. Out-of-Sample and Reconstruction Quality

To address the out-of-sample limitation inherent in t-SNE and its variants, a Multi-Layer Perceptron (MLP) regressor was employed to approximate the mapping from high-dimensional input data to their corresponding low-dimensional embeddings. Separate models were trained using embeddings generated by t-SNE with K L , K L max , and L K L W loss functions. Once trained, the MLP enables fast and efficient projection of new, unseen data points into the embedding space without rerunning the full optimization procedure, making the method more scalable and practical for dynamic or streaming scenarios.
The quality of these learned mappings was evaluated using MSE between the predicted embeddings and the original embeddings produced by each method. As shown in Table 4, both K L max and L KL W achieve lower MSE than the standard K L -based t-SNE, with K L max achieving the best reconstruction performance (MSE = 0.0014 ). This indicates stronger generalization capabilities and more faithful preservation of the underlying data structure, because the standard asymmetric K L ( P Q ) primarily penalizes missed neighbors and can tolerate false neighbors (leading to crowding and global distortions), whereas K L max penalizes discrepancies in both directions, reducing spurious attractions and better aligning local–global relationships; additionally, L KL W = K L + λ W augments the objective with a Wasserstein term that encourages globally coherent layouts, further lowering reconstruction error.
Table 4. Average MSE of Reconstructed Faces in the Olivetti Dataset.
To further assess the preservation of structural and semantic features, a reconstruction experiment was conducted using the Olivetti Faces dataset. Figure 7 shows that the embeddings derived from K L max and L K L W allow for the recovery of faces with clearer contours, sharper expressions, and reduced artifacts. In contrast, reconstructions from K L -based embeddings suffer from noticeable blurring and loss of detail. These qualitative and quantitative results confirm the superiority of the proposed loss functions in preserving essential information for both embedding quality and out-of-sample extension.
Figure 7. Reconstruction of Olivetti Faces using K L , K L max and L K L W , with MSE values shown for each method. Images from the A T / T Laboratories Cambridge Database of Faces; used with attribution [28].

6. Advantages, Limitations, and Applications

6.1. Advantages

The proposed loss functions, K L max and L K L W , offer several advantages over the K L divergence in the context of dimensionality reduction and data visualization:
  • Improved Similarity Ranking: The introduction of K L max enhances the model’s ability to preserve similarity rankings, making the embeddings more consistent with the underlying class structure.
  • Better Reconstruction Quality: Both loss functions reduce reconstruction error (as reflected in lower MSE values), and the visual results demonstrate sharper and more accurate data recovery.
  • Robustness Across Datasets: Empirical results show consistent performance gains on a variety of datasets, including structured (e.g., Swiss Roll) and real-world datasets (e.g., MNIST, Olivetti Faces).

6.2. Limitations

Despite the demonstrated benefits, the proposed approaches also have certain limitations:
  • Increased Computational Complexity: Integrating Wasserstein distance or ranking-based terms into the loss functions introduces additional computational overhead relative to standard t-SNE. Specifically, K L max adds O ( N 2 ) pairwise comparisons with an additional ranking step that marginally increases runtime (about 10%), while L K L W incorporates a batch-wise Wasserstein distance term with complexity O ( m b log b ) , where m is the number of random directions and b is the batch size.
  • Hyperparameter Sensitivity: Parameters such as the exaggeration factor or weighting coefficients require tuning, which may limit generalizability without cross-validation.

6.3. Applications

The proposed loss functions and enhancements to t-SNE are well-suited for many applications:
  • Bioinformatics and Genomics: For interpreting gene expression data or cell populations where subtle structures (e.g., differentiation trajectories) matter.
  • Facial Recognition and Image Retrieval: Improved clustering and reconstruction quality make these methods effective for feature extraction and identity preservation.
  • Out-of-Sample Generalization: One of the key limitations of traditional t-SNE lies in its inability to naturally handle out-of-sample data. Since the algorithm lacks an explicit mapping function, incorporating new data points requires re-running the full embedding process, which is computationally expensive and unsuitable for dynamic or streaming environments.

7. Discussion

This work set out to address two well-known limitations of t-SNE’s KL -based objective, its emphasis on local neighborhoods at the expense of global structure and its asymmetric sensitivity, by introducing K L m a x and the L K L W loss. Across synthetic and real-world datasets, our experiments show that both objectives improve cluster separability and structural clarity relative to standard K L , with L K L W most consistently enhancing long-range organization and K L m a x sharpening local neighborhoods. These trends are reflected quantitatively (higher NMI/F1 and lower or comparable DBI) and qualitatively in the 2D embeddings. Notably, modest increases in aggregate loss values for the proposed objectives do not translate to degraded embeddings; instead, they arise from additional structure-aware terms that guide the optimizer toward configurations that better respect the data geometry.
Our findings align with and extend prior observations that alternative divergences or geometry-aware criteria can remedy K L ’s biases in neighbor embedding. By explicitly penalizing rank deviations ( K L m a x ) and by incorporating transport geometry ( L K L W ), the proposed losses reconcile local fidelity with broader organization. In particular, L K L W leverages Wasserstein’s smooth, geometry-aware transport to better capture manifold continuity (e.g., COIL-20 rotations), while K L m a x reduces within-cluster variance by enforcing consistency among the highest-similarity relationships. The resulting embeddings exhibit clearer global layouts without sacrificing the hallmark local detail that makes t-SNE appealing.
A practical contribution of this study is guidance on hyperparameter regimes. Sensitivity analyses indicate that moderate values, α 0.5 0.7 for K L m a x and β 250 –500 for L K L W , strike a robust balance between local ranking fidelity and global structure, with performance typically degrading if either term dominates the objective. Interestingly, COIL-20’s DBI remains relatively stable even when NMI/F1 decline at high α or β , suggesting that structure-only metrics can under- or over-estimate semantic separability when manifold continuity drives class organization. This divergence between internal and external indices highlights the importance of reporting both metric families and of visually inspecting embeddings when the data have strong geometric factors.
From an optimization perspective, both K L m a x and L K L W demonstrate tighter, more symmetric loss distributions over repeated runs and statistically significant differences relative to the K L baseline (KS and Mann–Whitney U tests, p < 0.05 across datasets), indicating improved robustness to initialization. While the added terms introduce a modest runtime overhead, the computational cost remains practical on commodity hardware and is offset by gains in stability and interpretability. Moreover, out-of-sample mapping via a simple MLP shows lower reconstruction M S E , particularly for K L m a x , suggesting that the learned embeddings expose structure that is easier to approximate with smooth predictors, a property valuable for streaming or incremental use cases.
Limitations remain; for example, the methods introduce additional hyperparameters and, for L K L W , require approximations to make transport terms tractable at scale. Although our experiments cover diverse datasets, broader evaluations—e.g., high-throughput single-cell data or large-vocabulary text embeddings—could further validate generality. In addition, while we observed that higher composite losses need not imply worse visualization quality, a deeper theoretical characterization of the proposed objectives’ minima and their relationship to cluster topology would be valuable.
For future work, several avenues follow naturally: (i) combine K L m a x and L K L W into a single objective to jointly enforce ranking and transport geometry, potentially with adaptive weighting during training; (ii) explore scalable transport surrogates (e.g., sliced or linearized optimal transport) tailored to neighbor-embedding kernels to further reduce overhead; (iii) integrate label information in a semi-supervised variant to explicitly balance semantic and geometric structure; (iv) develop principled, data-dependent schedules for α and β based on online diagnostics (e.g., changes in DBI vs. NMI), reducing manual tuning; and (v) extend the out-of-sample mapping with uncertainty estimation to flag projections likely to deviate from the manifold learned by the embedding.

Author Contributions

Conceptualization, R.H. and S.B.B.; methodology, R.H. and S.B.B.; software, S.N.; validation, S.N., R.H. and S.B.B.; formal analysis, S.N.; investigation, S.N.; resources, S.N.; data curation, S.N.; writing original draft preparation, S.N.; reviewing and editing, R.H. and S.B.B.; supervision, R.H. and S.B.B.; project administration, R.H. and S.B.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Qatar National Research Fund (QNRF), grant number GSRA9-L-1-0512-22011.

Data Availability Statement

All datasets used in this study are publicly available. The MNIST dataset is accessible at https://www.kaggle.com/datasets/hojjatk/mnist-dataset. The Pendigits dataset can be obtained from the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/81/pen+based+recognition+of+handwritten+digits. The Fashion-MNIST dataset is available at https://github.com/zalandoresearch/fashion-mnist. The Olivetti Faces (ORL) dataset is available at https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. The COIL-20 dataset can be accessed at https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
  2. McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
  3. van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  4. Wan, N.; Li, D.; Hovakimyan, N. F-divergence variational inference. Adv. Neural Inf. Process. Syst. 2020, 33, 17370–17379. [Google Scholar]
  5. Arora, S.; Hu, W.; Kothari, P.K. An analysis of the t-SNE algorithm for data visualization. In Proceedings of the Conference on Learning Theory, Stockholm, Sweden, 6–9 July 2018; PMLR: Cambridge, MA, USA, 2018; pp. 1455–1462. [Google Scholar]
  6. Im, D.J.; Verma, N.; Branson, K. Stochastic neighbor embedding under f-divergences. arXiv 2018, arXiv:1811.01247. [Google Scholar] [CrossRef]
  7. Yao, W.; Yang, W.; Wang, Z.; Lin, Y.; Liu, Y. Revisiting weak-to-strong generalization in theory and practice: Reverse KL vs. Forward KL. arXiv 2025, arXiv:2502.11107. [Google Scholar]
  8. Yang, Z.; Chen, Y.; Corander, J. t-SNE is not optimized to reveal clusters in data. arXiv 2021, arXiv:2110.02573. [Google Scholar] [CrossRef]
  9. Naderializadeh, N.; Li, R.; Xiao, D.; Shrivastava, A.; Soatto, R. Set representation learning with generalized sliced-Wasserstein embeddings. arXiv 2021, arXiv:2103.03892. [Google Scholar] [CrossRef]
  10. Kolouri, S.; Park, S.R.; Thorpe, M.; Slepcev, D.; Rohde, G.K. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Process. Mag. 2017, 34, 43–59. [Google Scholar] [CrossRef]
  11. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 2013, 26, 2292–2300. [Google Scholar]
  12. Genevay, A.; Cuturi, M.; Peyré, G.; Bach, F. Stochastic optimization for large-scale optimal transport. Adv. Neural Inf. Process. Syst. 2016, 29, 3440–3448. [Google Scholar]
  13. Seguy, V.; Damodaran, B.B.; Flamary, R.; Courty, N.; Rolet, A.; Blondel, M. Large-scale optimal transport and mapping estimation. arXiv 2017, arXiv:1711.02283. [Google Scholar]
  14. Gallouët, T.; Ghezzi, R.; Vialard, F.X. Regularity theory and geometry of unbalanced optimal transport. J. Funct. Anal. 2025, 289, 111042. [Google Scholar] [CrossRef]
  15. Liu, Y.; Wang, C.; Lu, M.; Yang, J.; Gui, J.; Zhang, S. From simple to complex scenes: Learning robust feature representations for accurate human parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5449–5462. [Google Scholar] [CrossRef] [PubMed]
  16. Gao, Z.; Wang, Y.; Chen, Y.; Feng, X.; Ji, H. Light-field image multiple reversible robust watermarking against geometric attacks. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7457–7471. [Google Scholar] [CrossRef]
  17. Wang, Y.; Wan, R.; Wang, Q.; Chan, K.C.C.; Chen, H. Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMap, and PaCMAP for data visualization. J. Mach. Learn. Res. 2021, 22, 1–45. [Google Scholar]
  18. Zhou, Y.; Sharpee, T.O. Using global t-SNE to preserve intercluster data structure. Neural Comput. 2022, 34, 1637–1651. [Google Scholar] [CrossRef]
  19. Kobak, D.; Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 2019, 10, 5416. [Google Scholar] [CrossRef]
  20. Villani, C. Optimal Transport: Old and New; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; Volume 338. [Google Scholar]
  21. Brenier, Y. Polar factorization and monotone rearrangement of vector-valued functions. Commun. Pure Appl. Math. 1991, 44, 375–417. [Google Scholar] [CrossRef]
  22. Kantorovich, L.V. On the translocation of masses. Dokl. Akad. Nauk. USSR (NS) 1942, 37, 199–201. [Google Scholar] [CrossRef]
  23. Peyré, G.; Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
  24. Alpaydin, E.; Alimoglu, F. Pen-based recognition of handwritten digits data set. Mach. Learn. Repos. 1998, 4, 2. [Google Scholar]
  25. LeCun, Y. The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 18 November 2025).
  26. Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
  27. Nene, A.; Nayar, K.; Murase, H. Columbia Object Image Library (COIL-20); Tech. Rep.; Columbia University: New York, NY, USA, 1996; pp. 223–303. [Google Scholar]
  28. AT&T Laboratories Cambridge. The Database of Faces (Olivetti/ORL Faces); AT&T Laboratories Cambridge: Cambridge, UK, 1994. [Google Scholar]
  29. Van Rijsbergen, C. Information retrieval: Theory and practice. In Proceedings of the Joint IBM/University of Newcastle upon Tyne Seminar on Data Base Systems, Suita, Japan, 4–7 September 1979; Volume 79, pp. 1–14. [Google Scholar]
  30. Powers, D.M.W. What the F-measure doesn’t measure: Features, flaws, fallacies and fixes. arXiv 2015, arXiv:1503.06410. [Google Scholar]
  31. Strehl, A.; Ghosh, J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
  32. Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2009, PAMI-1, 224–227. [Google Scholar] [CrossRef]
  33. Massey, F.J., Jr. The Kolmogorov–Smirnov Test for Goodness of Fit. J. Am. Stat. Assoc. 1951, 46, 68–78. [Google Scholar] [CrossRef]
  34. Mann, H.B.; Whitney, D.R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 1947, 18, 50–60. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.