# Improving Dimensionality Reduction Projections for Data Visualization

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Motivation

#### 1.2. Related Work

#### 1.3. Contributions

## 2. Materials and Methods

#### 2.1. Background

- A new method for high-dimensional vector manipulation that improves a wide range of DR algorithms.
- A validation study that proves that our technique enhances the results in many scenarios.
- Data visualization examples that use document embeddings and provide evidence that the technique also works with other kinds of high-dimensional data.

#### 2.2. Vector Manipulation

**Inverse Document Frequency**(IDF), introduced as term specificity Spärk Jones [32,33] is a common concept used in document processing, and it is defined as:

## 3. Validation

#### 3.1. Datasets

#### 3.2. Experiment Setup

- For each dataset d, a transformation is applied using our algorithm, giving the transformed dataset ${d}_{t}$.
- We project both d and ${d}_{t}$ using PaCMAP, tSNE, trimap, and UMAP.
- Each projected set is evaluated five times with the linear SVM method, and the results are averaged.

## 4. Document Visualization

#### 4.1. Document Embeddings

#### 4.1.1. Doc2vec Model Training

#### 4.1.2. Synthetic Dataset Creation

#### 4.1.3. Data Preprocessing

#### 4.1.4. Clusters’ Quality Analysis

## 5. Conclusions and Future Work

#### 5.1. Conclusions

#### 5.2. Discussion

#### 5.3. Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

DR | Dimensionality Reduction |

DT | Decision Trees |

KNN | K-Nearest Neighbors |

MLP | Multilayer Perceptron |

PaCMAP | Pairwise Controlled Manifold Approximation Projection |

SVM | Support Vector Machines |

trimap | Large-scale Dimensionality Reduction Using Triplets |

t-SNE | t-Distributed Stochastic Neighbor Embedding |

UMAP | Uniform Manifold Approximation and Projection |

XGBoost | Extreme Gradient Boosting |

## References

- Wang, Y.; Huang, H.; Rudin, C.; Shaposhnik, Y. Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization. J. Mach. Learn. Res.
**2021**, 22, 9129–9201. [Google Scholar] - Hinterreiter, A.; Steinparz, C.; Schöfl, M.; Stitz, H.; Streit, M. Projection path explorer: Exploring visual patterns in projected decision-making paths. ACM Trans. Interact. Intell. Syst.
**2021**, 11, 22. [Google Scholar] [CrossRef] - Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.A.; Kwok, I.W.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol.
**2019**, 37, 38–44. [Google Scholar] [CrossRef] [PubMed] - Vlachos, M.; Domeniconi, C.; Gunopulos, D.; Kollios, G.; Koudas, N. Non-linear dimensionality reduction techniques for classification and visualization. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; pp. 645–651. [Google Scholar]
- Cunningham, J.P.; Ghahramani, Z. Linear dimensionality reduction: Survey, insights, and generalizations. J. Mach. Learn. Res.
**2015**, 16, 2859–2900. [Google Scholar] - Lee, J.A.; Verleysen, M. Nonlinear Dimensionality Reduction; Springer: Berlin/Heidelberg, Germany, 2007; Volume 1. [Google Scholar]
- Ayesha, S.; Hanif, M.K.; Talib, R. Overview and comparative study of dimensionality reduction techniques for high dimensional data. Inf. Fusion
**2020**, 59, 44–58. [Google Scholar] [CrossRef] - Sorzano, C.O.S.; Vargas, J.; Montano, A.P. A survey of dimensionality reduction techniques. arXiv
**2014**, arXiv:1403.2877. [Google Scholar] - Engel, D.; Hüttenberger, L.; Hamann, B. A survey of dimension reduction methods for high-dimensional data analysis and visualization. In Proceedings of the Visualization of Large and Unstructured Data Sets: Applications in Geospatial Planning, Modeling and Engineering-Proceedings of IRTG 1131 Workshop, Kaiserslautern, Germany, 10–11 June 2011; Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik: Wadern, Germany, 2012. [Google Scholar]
- Van Der Maaten, L.; Postma, E.; Van den Herik, J. Dimensionality reduction: A comparative. J. Mach. Learn Res.
**2009**, 10, 66–71. [Google Scholar] - Sedlmair, M.; Brehmer, M.; Ingram, S.; Munzner, T. Dimensionality Reduction in the Wild: Gaps and Guidance; Tech. Rep. TR-2012-03; Department of Computer Science, University of British Columbia: Vancouver, BC, Canada, 2012. [Google Scholar]
- Huang, H.; Wang, Y.; Rudin, C.; Browne, E.P. Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization. Commun. Biol.
**2022**, 5, 719. [Google Scholar] [CrossRef] - Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.
**2008**, 9, 2579–2605. [Google Scholar] - Wattenberg, M.; Viégas, F.; Johnson, I. How to use t-SNE effectively. Distill
**2016**, 1, e2. [Google Scholar] [CrossRef] - Caillou, P.; Renault, J.; Fekete, J.D.; Letournel, A.C.; Sebag, M. Cartolabe: A web-based scalable visualization of large document collections. IEEE Comput. Graph. Appl.
**2020**, 41, 76–88. [Google Scholar] [CrossRef] [PubMed] - McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv
**2020**, arXiv:stat.ML/1802.03426. [Google Scholar] - Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Reykjavik, Iceland, 22–25 April 2014; pp. 1188–1196. [Google Scholar]
- Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Silva, D.; Bacao, F. MapIntel: Enhancing Competitive Intelligence Acquisition through Embeddings and Visual Analytics. In Proceedings of the EPIA Conference on Artificial Intelligence, Lisbon, Portugal, 31 August–2 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 599–610. [Google Scholar]
- Abdullah, S.S.; Rostamzadeh, N.; Sedig, K.; Garg, A.X.; McArthur, E. Visual analytics for dimension reduction and cluster analysis of high dimensional electronic health records. Informatics
**2020**, 7, 17. [Google Scholar] [CrossRef] - Humer, C.; Heberle, H.; Montanari, F.; Wolf, T.; Huber, F.; Henderson, R.; Heinrich, J.; Streit, M. ChemInformatics Model Explorer (CIME): Exploratory analysis of chemical model explanations. J. Cheminform.
**2022**, 14, 21. [Google Scholar] [CrossRef] [PubMed] - Burch, M.; Kuipers, T.; Qian, C.; Zhou, F. Comparing dimensionality reductions for eye movement data. In Proceedings of the 13th International Symposium on Visual Information Communication and Interaction, Eindhoven, The Netherlands, 8–10 December 2020; pp. 1–5. [Google Scholar]
- Dorrity, M.W.; Saunders, L.M.; Queitsch, C.; Fields, S.; Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun.
**2020**, 11, 1537. [Google Scholar] [CrossRef] [PubMed] - Tang, J.; Liu, J.; Zhang, M.; Mei, Q. Visualizing large-scale and high-dimensional data. In Proceedings of the 25th International Conference on World Wide Web, Montreal, QC, Canada, 11–15 April 2016; pp. 287–297. [Google Scholar]
- Amid, E.; Warmuth, M.K. TriMap: Large-scale dimensionality reduction using triplets. arXiv
**2019**, arXiv:1910.00204. [Google Scholar] - Jeon, H.; Ko, H.K.; Lee, S.; Jo, J.; Seo, J. Uniform Manifold Approximation with Two-phase Optimization. In Proceedings of the 2022 IEEE Visualization and Visual Analytics (VIS), Oklahoma City, OK, USA, 16–21 October 2022; pp. 80–84. [Google Scholar]
- Sedlmair, M.; Munzner, T.; Tory, M. Empirical guidance on scatterplot and dimension reduction technique choices. IEEE Trans. Vis. Comput. Graph.
**2013**, 19, 2634–2643. [Google Scholar] [CrossRef] - Espadoto, M.; Martins, R.M.; Kerren, A.; Hirata, N.S.; Telea, A.C. Toward a quantitative survey of dimension reduction techniques. IEEE Trans. Vis. Comput. Graph.
**2019**, 27, 2153–2173. [Google Scholar] [CrossRef] - Olobatuyi, K.; Parker, M.R.; Ariyo, O. Cluster weighted model based on TSNE algorithm for high-dimensional data. Int. J. Data Sci. Anal.
**2023**. [Google Scholar] [CrossRef] - Allaoui, M.; Kherfi, M.L.; Cheriet, A. Considerably improving clustering algorithms using UMAP dimensionality reduction technique: A comparative study. In Proceedings of the International Conference on Image and Signal Processing, Marrakesh, Morocco, 4–6 June 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 317–325. [Google Scholar]
- Church, K.; Gale, W. Inverse document frequency (idf): A measure of deviations from poisson. In Natural Language Processing Using Very Large Corpora; Springer: Berlin/Heidelberg, Germany, 1999; pp. 283–295. [Google Scholar]
- Sparck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc.
**1972**, 28, 11–21. [Google Scholar] [CrossRef] - Robertson, S. Understanding inverse document frequency: On theoretical arguments for IDF. J. Doc.
**2004**, 60, 503–520. [Google Scholar] [CrossRef] - Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory
**1967**, 13, 21–27. [Google Scholar] [CrossRef] - Quinlan, J.R. Induction of Decision Trees. Mach. Learn.
**1986**, 1, 81–106. [Google Scholar] [CrossRef] - Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; KDD’16. ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
- Haykin, S. Neural Networks: A Comprehensive Foundation; Prentice Hall PTR: London, UK, 1994. [Google Scholar]
- LeCun, Y.; Cortes, C. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 15 May 2023).
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv
**2017**, arXiv:1708.07747. [Google Scholar] - Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/ (accessed on 27 July 2023).
- Nene, S.A.; Nayar, S.K.; Murase, H. Columbia Object Image Library (Coil-20); Technical Report; Columbia University: New York, NY, USA, 1996. [Google Scholar]
- Reyes-Ortiz, J.; Anguita, D.; Ghio, A.; Oneto, L.; Parra, X. Human Activity Recognition Using Smartphones. UCI Mach. Learn. Repos.
**2012**. [Google Scholar] [CrossRef] - Kotzias, D. Sentiment Labelled Sentences. UCI Mach. Learn. Repos.
**2015**. [Google Scholar] [CrossRef] - Yuval, N. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain, 12–17 December 2011. [Google Scholar]
- Hull, J.J. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell.
**1994**, 16, 550–554. [Google Scholar] [CrossRef] - Sharan, L.; Rosenholtz, R.; Adelson, E. Material perception: What can you see in a brief glance? J. Vis.
**2009**, 9, 784. [Google Scholar] [CrossRef] - Lang, K. 20 Newsgroups Dataset. Available online: https://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/news20.html (accessed on 15 May 2023).
- Cutura, R.; Holzer, S.; Aupetit, M.; Sedlmair, M. VisCoDeR: A tool for visually comparing dimensionality reduction algorithms. In Proceedings of the Esann, Bruges, Belgium, 25–27 April 2018. [Google Scholar]
- Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; COLT’92. Association for Computing Machinery: New York, NY, USA, 1992; pp. 144–152. [Google Scholar] [CrossRef]
- Chuang, J.; Ramage, D.; Manning, C.; Heer, J. Interpretation and trust: Designing model-driven visualizations for text analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Austin, TX, USA, 5–10 May 2012; pp. 443–452. [Google Scholar]
- Landauer, T.K.; Laham, D.; Derr, M. From paragraph to graph: Latent semantic analysis for information visualization. Proc. Natl. Acad. Sci. USA
**2004**, 101, 5214–5219. [Google Scholar] [CrossRef] - Kim, K.; Lee, J. Sentiment visualization and classification via semi-supervised nonlinear dimensionality reduction. Pattern Recognit.
**2014**, 47, 758–768. [Google Scholar] [CrossRef] - Harris, Z.S. Distributional structure. Word
**1954**, 10, 146–162. [Google Scholar] [CrossRef] - Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res.
**2003**, 3, 993–1022. [Google Scholar] - Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv
**2013**, arXiv:1301.3781. [Google Scholar] - Lo, K.; Wang, L.L.; Neumann, M.; Kinney, R.; Weld, D. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Toronto, ON Canada, 2020; pp. 4969–4983. [Google Scholar] [CrossRef]
- Alvarez, J.E.; Bast, H. A Review of Word Embedding and Document Similarity Algorithms Applied to Academic Text. Bachelor Thesis, University of Freiburg, Freiburg im Breisgau, Germany, 2017. [Google Scholar]
- Rehurek, R.; Sojka, P. Gensim–Python Framework for Vector Space Modelling; NLP Centre, Faculty of Informatics, Masaryk University: Brno, Czech Republic, 2011; Volume 3. [Google Scholar]
- Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
- Gómez, J.; Vázquez, P.P. An Empirical Evaluation of Document Embeddings and Similarity Metrics for Scientific Articles. Appl. Sci.
**2022**, 12, 5664. [Google Scholar] [CrossRef] - Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv
**2020**, arXiv:2004.05150. [Google Scholar]

**Figure 1.**Dimensionality reduction of the spheres’ dataset using the different DR algorithms. Note that one of the classes, identified with the pink color, tends to overlap with the other clusters in the different cases. Our modifications effectively disentangle a substantial portion of the data points, resulting in a clearer and improved data visualization.

**Figure 2.**Comparing other clustering algorithms for the PaCMAP DR algorithm. The results are equivalent to the ones obtained with SVM.

**Figure 3.**Evaluation of trimap clustering using KNN, DT, MLP, and XGBoost. As in the previous case, the results are analogous to the ones obtained with SVM.

**Figure 5.**Analysis of the UMAP DR data. The additional clustering algorithms also work similarly to the SVM.

**Figure 6.**Dimensionality reduction of the USPS dataset utilizing different DR algorithms. The PaCMAP (

**a**) exhibits good results with the original data. Applying our vector modification (

**b**), although numerically inferior to the original, the clusters are not substantially changed. For the tSNE (

**c**), the modified version may give poorer results due to the fragmentation of some of the clusters that can be seen in (

**d**). trimap DR (

**e**) is less effective than PaCMAP or UMAP at cluster separation. Our algorithm output (

**f**) appears visually akin to the original version (

**e**). Finally, UMAP (

**g**) yields results similar to PaCMAP, with well-concentrated clusters and only minor instances of collision or overlap. In this case, the data modification in (

**h**) causes slight proximity changes in a couple of clusters.

**Figure 7.**The Coil20 dataset demonstrates enhanced clustering accuracy across all DR algorithms following our modification, except for the tSNE algorithm. This dataset contains 20 classes. Observe that the overall distribution of clusters improves after our modification (second and last column).

**Figure 8.**The documents dataset indicated clear improvements in all the DR algorithms. Please note that the modified versions generated projections with a better separation between clusters.

**Figure 9.**This document set was not as clearly separated by any of the DR techniques. With our modification, the trimap demonstrated superior performance in achieving cluster separation. The other techniques also improved, but the clustering separations were not obvious.

Dataset | Dimensions | Samples | Classes |
---|---|---|---|

20NG | 99 | 18,844 | 20 |

Cifar10 | 1024 | 3250 | 10 |

Coil20 | 400 | 1440 | 20 |

Fashion-MNIST | 784 | 10,000 | 10 |

FlickMaterial10 | 1534 | 997 | 10 |

Har | 561 | 735 | 6 |

MNIST | 784 | 70,000 | 10 |

Sentiment | 200 | 2748 | 2 |

Spheres | 101 | 10,000 | 11 |

Svhn | 1024 | 732 | 9 |

USPS | 255 | 9298 | 10 |

**Table 2.**Impact of the vector manipulation algorithm on the PaCMAP DR method for clustering using SVM. The center column displays the accuracy of the clustering algorithm when no vector manipulation is applied to the data, while the rightmost column shows the accuracy when our manipulation approach is applied. Notably, in the majority of the cases (9 out of 11), the accuracy improves.

Dataset | Original + PaCMAP (SVM) | Improved + PaCMAP (SVM) |
---|---|---|

20NG | 0.5156 | 0.8605 |

Cifar10 | 0.2029 | 0.2252 |

Coil20 | 0.8278 | 0.8639 |

Fashion-MNIST | 0.7232 | 0.7424 |

FlickMaterial10 | 0.5433 | 0.6047 |

Har | 0.7285 | 0.7982 |

MNIST | 0.9733 | 0.9416 |

Sentiment | 0.6177 | 0.7898 |

Spheres | 0.6653 | 0.9755 |

Svhn | 0.1855 | 0.1955 |

USPS | 0.9476 | 0.9401 |

**Table 3.**tSNE accuracy without (center) and with (right) the algorithm. In this case, 7 out of the 11 models improve the result.

Dataset | Original + tSNE (SVM) | Improved + tSNE (SVM) |
---|---|---|

20NG | 0.4646 | 0.7854 |

Cifar10 | 0.2031 | 0.2306 |

Coil20 | 0.8319 | 0.8185 |

Fashion-MNIST | 0.7222 | 0.7373 |

FlickMaterial10 | 0.5633 | 0.6207 |

Har | 0.8588 | 0.8317 |

MNIST | 0.9666 | 0.9174 |

Sentiment | 0.5852 | 0.8252 |

Spheres | 0.7626 | 0.8933 |

Svhn | 0.1909 | 0.1982 |

USPS | 0.9528 | 0.9333 |

**Table 4.**Accuracy obtained using trimap dimensionality reduction. The center column shows the results without our method, while the rightmost use our algorithm. trimap breaks for the 20NG model. In this case, we obtain an improvement in 8 out of the 10 models.

Dataset | Original + Trimap (SVM) | Improved + Trimap (SVM) |
---|---|---|

20NG | - | - |

Cifar10 | 0.1914 | 0.2273 |

Coil20 | 0.7977 | 0.8236 |

Fashion-MNIST | 0.7161 | 0.7251 |

FlickMaterial10 | 0.4780 | 0.6100 |

Har | 0.6688 | 0.7620 |

MNIST | 0.9636 | 0.8706 |

Sentiment | 0.4812 | 0.9522 |

Spheres | 0.7686 | 0.9757 |

Svhn | 0.1891 | 0.1964 |

USPS | 0.9387 | 0.9237 |

**Table 5.**Accuracy obtained using UMAP. The center column shows the results without our method, while the rightmost use our algorithm. Here, we obtained an improvement in 8 out of the 11 models.

Dataset | Original + UMAP (SVM) | Improved + UMAP (SVM) |
---|---|---|

20NG | 0.4859 | 0.8076 |

Cifar10 | 0.2062 | 0.2195 |

Coil20 | 0.7894 | 0.8634 |

Fashion-MNIST | 0.7247 | 0.7343 |

FlickMaterial10 | 0.5833 | 0.6900 |

Har | 0.8235 | 0.8054 |

MNIST | 0.9650 | 0.9243 |

Sentiment | 0.5927 | 0.6327 |

Spheres | 0.5213 | 0.8933 |

Svhn | 0.1955 | 0.2045 |

USPS | 0.9520 | 0.9380 |

**Table 6.**The table presents a comparative analysis of the enhancements achieved through various Dimensionality Reduction (DR) techniques with our modification method. It displays the mean improvement obtained by employing five distinct clustering approaches. The models that demonstrate an improvement are highlighted in blue, while those that do not are highlighted in red. It is noteworthy that some models, which exhibit no improvement, are consistent across all DR techniques. Although these models exhibit a decline in accuracy, the magnitude of such reductions is negligible.

Dataset | Improved + PaCMAP | Improved + tSNE | Improved + Trimap | Improved + UMAP |
---|---|---|---|---|

20NG | $62.98\%$ | $47.62\%$ | - | $65.66\%$ |

Cifar10 | $11.09\%$ | $21.64\%$ | $12.20\%$ | $4.96\%$ |

Coil20 | $2.61\%$ | $\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}3.48\%$ | $2.80\%$ | $6.38\%$ |

Fashion-MNIST | $\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}3.54\%$ | $\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}3.12\%$ | $2.79\%$ | $2.58\%$ |

FlickMaterial10 | $12.92\%$ | $13.25\%$ | $26.85\%$ | $16.80\%$ |

Har | $\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}9.50\%$ | $-0.27\%$ | $\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}8.08\%$ | $-1.50\%$ |

MNIST | $-3.25\%$ | $-2.51\%$ | $-8.38\%$ | $-2.65\%$ |

Sentiment | $30.46\%$ | $30.11\%$ | $75.33\%$ | $22.33\%$ |

Spheres | $28.83\%$ | $16.55\%$ | $13.15\%$ | $66.43\%$ |

Svhn | $\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}17.02\%$ | $30.03\%$ | $18.39\%$ | $29.76\%$ |

USPS | $-1.61\%\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}$ | $-0.89\%$ | $-1.91\%$ | $-1.32\%$ |

Name of the Cluster | Number of Articles |
---|---|

Artificial Intelligence (AI) | 20 |

Astrophysics Galaxies (APG) | 20 |

Bicycle Sharing Systems (BSS) | 19 |

Computer Graphics—Ambient Occlusion (AO) | 24 |

Electrical Engineering 2022 (EE22) | 22 |

Electrical Engineering 2015 (EE15) | 24 |

Global Illumination (GI) | 25 |

High-Energy Astrophysics (HEAP) | 20 |

Information Theory (IT) | 23 |

Molecular Visualization in Virtual Reality (MVVR) | 20 |

Viewpoint Selection (VS) | 19 |

Visualization (Vis) | 20 |

Volume Rendering (VolRend) | 22 |

**Table 8.**The average similarity scores for the different clusters in the technical documents’ dataset. Notice how the inner average similarity of each class is consistently lower than the average similarities against other classes, except for the HEAP (High-Energy Astrophysical Phenomena) and APG (Astrophysics of Galaxies) classes. This aligns with the fact that both classes are closely related and belong to the same arXiv superclass. Additionally, as anticipated, the Volume Rendering and Ambient Occlusion classes exhibit a high degree of similarity. In addition, we can also see that the Volume Rendering and Ambient Occlusion are very similar, as expected.

AI | APG | BSS | AO | EE15 | EE22 | GI | HEAP | IT | MVVR | VS | Vis | VolRend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

AI | 0.144 | 0.102 | 0.130 | 0.129 | 0.091 | 0.104 | 0.116 | 0.096 | 0.129 | 0.121 | 0.125 | 0.114 | 0.123 |

APG | 0.102 | 0.290 | 0.213 | 0.195 | 0.077 | 0.093 | 0.106 | 0.249 | 0.189 | 0.200 | 0.192 | 0.099 | 0.190 |

BSS | 0.130 | 0.213 | 0.374 | 0.214 | 0.106 | 0.111 | 0.128 | 0.201 | 0.223 | 0.234 | 0.231 | 0.162 | 0.217 |

AO | 0.129 | 0.195 | 0.214 | 0.397 | 0.095 | 0.111 | 0.250 | 0.186 | 0.213 | 0.239 | 0.286 | 0.161 | 0.346 |

EE15 | 0.091 | 0.077 | 0.106 | 0.095 | 0.115 | 0.092 | 0.080 | 0.079 | 0.097 | 0.091 | 0.099 | 0.080 | 0.098 |

EE22 | 0.104 | 0.093 | 0.111 | 0.111 | 0.092 | 0.117 | 0.105 | 0.088 | 0.109 | 0.102 | 0.110 | 0.093 | 0.116 |

GI | 0.116 | 0.106 | 0.128 | 0.250 | 0.080 | 0.105 | 0.227 | 0.106 | 0.121 | 0.148 | 0.178 | 0.137 | 0.221 |

HEAP | 0.096 | 0.249 | 0.201 | 0.186 | 0.079 | 0.088 | 0.106 | 0.248 | 0.181 | 0.185 | 0.180 | 0.099 | 0.185 |

IT | 0.129 | 0.189 | 0.223 | 0.213 | 0.097 | 0.109 | 0.121 | 0.181 | 0.312 | 0.208 | 0.231 | 0.123 | 0.215 |

MVVR | 0.121 | 0.200 | 0.234 | 0.239 | 0.091 | 0.102 | 0.148 | 0.185 | 0.208 | 0.313 | 0.232 | 0.161 | 0.241 |

VS | 0.124 | 0.195 | 0.232 | 0.270 | 0.096 | 0.107 | 0.168 | 0.181 | 0.223 | 0.259 | 0.301 | 0.163 | 0.274 |

Vis | 0.115 | 0.099 | 0.163 | 0.160 | 0.080 | 0.093 | 0.136 | 0.099 | 0.121 | 0.158 | 0.163 | 0.194 | 0.174 |

VolRend | 0.123 | 0.190 | 0.217 | 0.346 | 0.098 | 0.116 | 0.221 | 0.185 | 0.215 | 0.241 | 0.291 | 0.176 | 0.347 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Rafieian, B.; Hermosilla, P.; Vázquez, P.-P.
Improving Dimensionality Reduction Projections for Data Visualization. *Appl. Sci.* **2023**, *13*, 9967.
https://doi.org/10.3390/app13179967

**AMA Style**

Rafieian B, Hermosilla P, Vázquez P-P.
Improving Dimensionality Reduction Projections for Data Visualization. *Applied Sciences*. 2023; 13(17):9967.
https://doi.org/10.3390/app13179967

**Chicago/Turabian Style**

Rafieian, Bardia, Pedro Hermosilla, and Pere-Pau Vázquez.
2023. "Improving Dimensionality Reduction Projections for Data Visualization" *Applied Sciences* 13, no. 17: 9967.
https://doi.org/10.3390/app13179967