Getting over High-Dimensionality: How Multidimensional Projection Methods Can Assist Data Science

: The exploration and analysis of multidimensional data can be pretty complex tasks, requiring sophisticated tools able to transform large amounts of data bearing multiple parameters into helpful information. Multidimensional projection techniques ﬁgure as powerful tools for transforming multidimensional data into visual information according to similarity features. Integrating this class of methods into a framework devoted to data sciences can contribute to generating more expressive means of visual analytics. Although the Principal Component Analysis (PCA) is a well-known method in this context, it is not the only one, and, sometimes, its abilities and limitations are not adequately discussed or taken into consideration by users. Therefore, knowing in-depth multidimensional projection techniques, their strengths, and the possible distortions they can create is of signiﬁcant importance for researchers developing knowledge-discovery systems. This research presents a comprehensive overview of current state-of-the-art multidimensional projection techniques and shows example codes in Python and R languages, all available on the internet. The survey segment discusses the different types of techniques applied to multidimensional projection tasks from their background, application processes, capabilities, and limitations, opening the internal processes of the methods and demystifying their concepts. We also illustrate two problems, from a genetic experiment (supervised) and text mining (non-supervised), presenting solutions through multidimensional projection application. Finally, we brought elements that reverberate the competitiveness of multidimensional projection techniques towards high-dimension data visualization, commonly needed in data sciences solutions.


Introduction
Datasets generated nowadays have become increasingly larger and more structurally complex, generated from a diverse range of sources, such as relational and NoSQL databases, web pages, texts, images, recordings, video, and others. The analytical effort to explore these datasets is a challenging problem, although they could load valuable content for discovering and understanding phenomena in many knowledge domains. The more data collected, the more complex the analysis, thus causing graphical perception to be infeasible due to the amount of information displayed on a screen [1]. Multidimensional data are those with m ≥ 4 attributes for each data object and may be represented as families of curves on the m-dimensional space [2]. Line and bar charts have long represented data in several applications. Nevertheless, line-based visual metaphors are not scalable for treating multiple variables at the same time, thus leading to problems of visual cluttering and occlusions [3].
An efficient and comprehensive representation of multidimensional data are beyond the capabilities of simple line and bar charts due to the complexity of the multidimensional structured data and the inability of these simple techniques to deal with large data volumes on a screen. Information Visualization (InfoVis) aims to develop and apply visual representations to model and understand attribute values, relationships, and extraction of information from data [4,5]. In this context, a class of InfoVis techniques has been a stand out for some time: the multidimensional projections. These techniques generate injective mappings that transform multidimensional spaces into visual spaces (2D or 3D), preserving structures and similarities of the original spaces as much as possible, such as relationships between data instances or clusters' presence [6][7][8]. This transformation is accomplished through mathematical operations to embed the number of attributes to 2 or 3, allowing m-dimensional data objects to be represented in Cartesian space.
Although it is not a recent concept, the research into new multidimensional projection techniques development has been intensified in recent years. This rise of interest is due to the wide range of applications that can benefit from visual representations of large datasets with a large number of attributes [9], and the research into projection methods has promoted the creation of visual tools that reveal relevant patterns and trends hidden in multidimensional data. Examples of successful applications involving projection techniques can be found in the more diversified domains [6,[10][11][12]. The current literature on multidimensional projections has followed the recent developments, with many works reviewing fundamental concepts, proposing taxonomies, and performance evaluations [7,[13][14][15][16][17][18].
However, due to the profusion of new developments, we have identified some things missing in the literature. Moreover, the different terminologies adopted in previous papers and reviews can be somewhat confusing for the researchers starting into the subject and even for those with some previous knowledge but (maybe) not specialized in InfoVis methodologies. For example, each technique often cannot handle all the data types, as each technique is tailored to efficiently handle one (or some) specific data type(s). In addition, all the projection techniques require careful data pre-processing to explore their embedding abilities properly. Nevertheless, we believe that multidimensional projection techniques are highly informative tools that can add analytical capabilities to data scientists within the context of knowledge discovery and data mining tasks; thus, a comprehensive discussion regarding simplifying the projection technique environment is a need.
In this sense, we aim to present to the reader that multidimensional data analyses go far from simple PCA. We will do it with a sufficiently detailed discussion of the leading and more advanced different types of techniques applied to multidimensional projection tasks through a vocabulary that unifies the many and different terms and definitions found in previous works. We will also address taxonomies and essential concepts and considerations that must be taken into account by the reader's interest in applying projection techniques on multidimensional data, from the appropriate data treatment to the possible distortions and limitations of the obtained results.
The manuscript is structured as follows: Section 2 presents the research methodology adopted for the collection of papers; Section 3 addresses the ground theory to understand the multidimensional projection domain; Section 4 investigates multidimensional projection approaches and their applications; Section 5 shows discussions and applications; Section 6 examines the challenges and future research directions concerning multidimensional projections; finally, the Section 7 presents our final comments.

Method of the Systematic Review
A content analysis of the published literature was conducted to understand better how the multidimensional projection techniques have evolved over the last few years. Such an analysis systematically evaluates the available forms of communication, identifying and classifying critical contributions to the field, clarifying trends and practices, and indicating future and open research possibilities. Therefore, we elaborated so that the research goals could be achieved, and we established objective criteria to define the relevant literature and the appropriate reporting of the findings.
We query "multidimensional projection", "dimensionality reduction", and "multidimensional scaling" terms, mainly restricted to (but not only) the 2005-2022 period and related to publications' title, abstract, and keywords. Such a period was chosen to cover most of the published literature not reported by previous surveys and including seminal surveys. The following two criteria were employed over the search results to select publications for a further revision: • Papers published in peer-reviewed journals as articles and available online in English are the priority sources. In addition to these papers, we extended our scope to conference proceedings, arXiv e-Prints, thesis, dissertations, and books; • Papers explicitly employ multidimensional data and multidimensional projection techniques. Then, we excluded those articles that only list multidimensional projection in keywords, allude to multidimensional data as datasets, or apply multidimensional projection without further explanation or reference to the specific methodology employed to processing and presenting the information.
Papers that did not fulfill at least one of the selection criteria were removed from the review. The queries returned 183 papers; after second filtering, only 131 were read in full. Duplicates were removed; the first selection criteria were applied, and the second criteria were used after carefully reviewing the texts. Finally, we reported 123 references here.

Ground Theory
This section provides an overview of definitions that should be considered when applying and developing multidimensional projection techniques. Our intent is not only to provide a comprehensive survey of multidimensional projection methods but also to discuss essential features and particularities in the context of data analysis and visualization.

Data Multidimensionality
Formally, a multidimensional dataset can be expressed as a set of n instances X = x 1 , x 2 , . . . , x n , in which each instance contains a vector of m items, x i = v 1 , v 2 , . . . , v m . Each x i vector can be seen as a data object and is represented by a subset of attributes of different data types, such as integers, reals, binaries, text, categorical, images, and others. When x i has more than three attributes, the dataset cannot be represented in simple visualizations such as those based on the Cartesian plane. In this case, the vector produces families of curves in the m-dimensional space [19].

Basic Terminology
The literature regarding multidimensional projection, or even multidimensional data, often utilizes many different terms to name similar concepts, which can lead to some mistakes. For example, it presents terms such as "multidimensional", "multivariate", or even "multivalued" for defining datasets in which each data object is composed of a set of attributes. However, these terms are not consistently used [20], and we will adopt the term "multidimensional" to denote datasets composed of multiple attributes with or without dependence on each other.
Since the beginning of this text, we have named each vector x i as data objects, but there are terms such as "instance", "observation", "data item" or just "item", "record", "array" or even "point". We named the vectors x i from a dataset X as "data objects" or "instances".
Another terminology is related to x i attributes, and one can find several terms denoting them, such as "variable", "dimension", "property", "characteristic", or "feature". We used throughout the text the terms "attribute" or "feature".
In addition, we will use the term "point" exclusively to name instances transformed into the visual space. We finish this subsection by noting that matrix notation from mathematics is standard to denote the x i data objects such as vectors. However, considering the context of data analysis, we argue that the terms adopted here are in line with precise data and mathematical terminology. This kind of explanation is due to the multidimensional projection domain being closely related to matrix theory on mathematical grounds.

Multidimensional Projection
A multidimensional dataset X, with n instances x i ∈ R m , can be represented as a matrix X n×m , where each row is a vector x i , and the columns represent the vector attributes. According to Ware [21], high-dimensional datasets are those with m ≥ 12 attributes. Since the visual space is limited to three dimensions, when the m-dimensionality increases, the complexity of representing and interpreting data also increases.
In this context, techniques devoted to dimensionality reduction aim to present multidimensional data from high-dimensional spaces as points in dimensionality-reduced spaces. Thus, it is possible to use fewer dimensions to describe the original data, compacting large datasets and reducing the computational effort required in their processing. According to Pudil and Novovicova [22], there are two main categories of dimensionality reduction methods: feature selection and feature extraction.
Feature selection identifies the non-significant features related to the analysis task, omitting them to generate a subset containing only useful attributes that can efficiently describe the input data [23]. Based on the assumption that the original dataset may have redundant (or even irrelevant) attributes, the removal process of non-significant features will generate a subset of useful features that automatically result in a reduced data space [24]. In several cases, feature selection is sufficient to carry out data analysis. However, there are situations in which it is challenging to identify a significant attributes subset, limiting the use of feature selection to the pre-processing data phase. As feature selection often does not result in a dataset of two or three dimensions, these methods are not part of this research. (for a comprehensive overview of feature selection methods, see Chandrashekar and Sahin [23]).
On the other hand, feature extraction methods try to reduce data dimensionality by mapping the original dataset X from an m-dimensional space to a new dataset Y of p dimensions, with p < m. The mapping process uses a single function or a group of transformation functions to map the data, retaining the essential characteristics of the original dataset [25]. When p ∈ {1, 2, 3}, there is a particular class of multidimensional data mapping techniques called Multidimensional Projections (MDP) [7].
MDP maps multidimensional data to the Cartesian visual space, seeking to preserve some information from the original space in the mapped (projected) space, such as the distance relationships between the data instances [6]. Consequently, it is possible to create a graphical representation from the projected points, to take advantage of the human perceptual abilities to recognize structures or visual patterns based on similarities, particularly those related to groups of elements [9]. Projection techniques have been attracting the attention of the Information Visualization community for some time due to their broad applicability as an analytical tool. Making the mapped data convenient for visualization, tasks that involve distance exploration or neighborhood relations can be streamlined [6].
Tejada et al. [26] mathematically defined the multidimensional projection concept as Definition 1. Let X be a set of n data objects in R m , with m > 3 and δ : R m × R m → R a distance measure between the instances in R m ; Y be a set of n points in R p , with p ∈ {1, 2, 3} and d : R p × R p → R a distance measure between points in R p . A multidimensional projection technique can be described as a function f : X → Y that aims to make |δ( Similarity and dissimilarity represent somewhat vague concepts, as they do not yet have a well-established definition in the literature and may present different interpretations depending on the application domain [7]. In the context of MDP, and to prevent any inaccuracy, we will adopt a geometric definition of (dis)similarity. Therefore, we established that dissimilarity is a numerical measure used to indicate the "proximity" (distance) between two data objects. The higher the dissimilarity value, the more distant and distinct the objects are, according to some criterion or comparison function [27]. Inversely, a similarity measure indicates how similar two objects are, with high values of similarity denoting close and similar objects.
In this context, the concept of distance has a central role in multidimensional data comparison and projection. An interesting set of dissimilarity metrics is the Minkowski distances calculated as: being k a modification parameter. The changing of k generates well-known functions, such as Manhattan distance (k = 1), Euclidean distance or L 2 norm (k = 2), and Maximum distance or L max norm (k = ∞). Several MDP techniques utilize Euclidean distance as the dissimilarity metric, such as the ones proposed by Maaten and Hinton [28], Joia et al. [12], and McInnes et al. [29]. Nevertheless, depending on the nature of the data and the application domain under analysis, Euclidean distance may not be the best way to express similarities or differences between data objects. A measure is considered a metric if it satisfies the metric space postulates [30], and some measures do not satisfy them. Therefore, it may be convenient to apply measures that do not meet the metric space properties in some cases. A deep analysis of the different types of existing measures is beyond the scope of this paper. Calculation of (dis)similarities can generate distortions owing to (i) significant differences between vectors Euclidean norms, and (ii) features with values much larger or much smaller than others can lead to projection results and force it to ignore the importance of other features [31]. According to Joia et al. [12], distortions are inevitable in the multidimensional projection process, but it is possible to keep them as small as possible.
To avoid these situations, a scaling process is applied to each vector (vector normalization) or data feature (features normalization or standardization). Scaling is one of the more important pre-processing steps of projection techniques because it can unify the feature space by forcing the unitary Euclidean norm of vectors or forcing features to satisfy a specific statistical property [32]. Figure 1 illustrates the application of normalization before checking the similarities between two data vectors. In this example, two sequences, A and B, have different graphical curves in terms of range and magnitude (horizontal and vertical shifting). However, the distance between A and B is proportional to the shifting, which can hide information about the sequences' similarity. In other words, if a proper normalization is not applied before the projection task, the projection method may overestimate the distances, which leads to distorted information regarding the correct amount of (dis)similarities among the data vectors (indicated by the shaded area between the curves in Figure 1, right side). After a vector normalization, one can determine the true similarities between the distinct curve profiles [33].
Following this example, each item x i of a multidimensional dataset X can be normalized with: being j less than m (the number of data features), and · the Euclidean norm. The importance and effects of normalization were extensively documented in papers such as Keogh and Kasetty [33]. Another example of transformation is the z-score standardization applied to each j-column of an X dataset, making columns' mean equal to zero and standard deviation equal to one. The process consists in subtracting all values of a j-column of its mean and dividing the result by the standard column deviation as in And another common transformation is the min-max normalization, which converts the j-column range [min x j , max x j ] to a new range [min new , max new ] with a simple linear transformation:x the new range is frequently [0, 1], and variations with other ranges and application of logarithm and square root functions can be performed [32]. This section presented elementary approaches focused on multidimensional data pre-processing real numerical values. However, other transformations may be more appropriate depending on the data domain (categorical, textual, images, and others) and the problem to be solved. There are several other ways to transform the data and measure its dissimilarities according to the data context, and a detailed review on the subject can be found in Zezula et al. [30].

Taxonomies Used in the Multidimensional Projection Area
The application of multidimensional projection techniques results in groups of points embedded in a p-dimensional space. We define a group, or a class, as a set of instances; however, there are some conceptual differences. A group can be formed by points perceptually close in the visual space, generated by a clustering algorithm that applies some dissimilarity measure in the data space. On the other hand, a class comprises objects that share some intrinsic meaning previously known or revealed by a classification algorithm that describes or distinguishes the objects belonging to the class [7]. Different approaches have been proposed to map multidimensional data into visual space, and the literature classifies them with some terminologies that share common characteristics. Among the taxonomies found in the literature, we can cite the best known and usually adopted as follows: (I) According to the transformation type Linear techniques are based on transformations that create linear combinations of the data attributes, mapping them into a new space with a reduced dimensionality [9]. Formally, linear methods map the data using a function f : X → Y, which satisfies the [25]. Although many MDPs apply linear resources in their processes, few approaches are truly linear, according to the strict mathematical definition [7].
Linear methods are often computationally efficient and relatively simple to develop. Once the transformation is calculated, the mapping of each instance in the Cartesian space is performed with matrix multiplication. Nevertheless, they cannot properly handle complex structures (such as data with nonlinear relationships between its attributes), achieving unsatisfactory results in many real scenarios [34]. This characteristic tends to generate representations with considerable distortions, which is a severe issue for visualization tasks.
Nonlinearity arises in many problems, such as physical phenomena modeling, and sometimes functions cannot satisfy linearity conditions that naturally motivate the development of nonlinear projections. Nonlinear techniques aim to minimize an information loss function usually supported by a mechanism able to relate the dissimilarities between the instances in the high dimensional space with the distances between the p-dimensional points [9].
Linear methods can present good results depending on the data distribution, but they rarely outperform nonlinear techniques in terms of the ability to deal with complex structures [16]. Another problem is the addition of new data subsets in the visual space. Linear techniques require complete reprocessing to introduce a perturbation in the dataset because they are based on matrix operations [9]. On the other hand, nonlinear approaches require only a small number of additional iterations to incorporate new data into the projection.
(II) According to the projection nature The concept of locality is used to distinguish two properties of projections nature: global modeling and local mapping. Global modeling defines methods built in a single transformation [12] and methods that seek to preserve original geometric relationships between all pairs of data instances in the transformed space [6]. That is, global modeling forces close points in the original space to remain close to the projected space, while distant points remain distant in the visual space.
Local mapping techniques are also based on neighborhood information, but they seek to preserve relationships considering only the surroundings of a small neighborhood in the m-dimensional space [6]. Close points to an instance in the high dimensional space must be projected as close as possible to that instance in the visual space. Generally, local methods build a family of transformations to project the data; therefore, instances are not mapped by a single transformation but using a set of local mappings in which each mapping will project a subset of data, preserving the local structures of the dataset.
Local methods can rely on global mechanisms to perform multidimensional projections or combine global and local properties, generating hybrid approaches. For example, a small subset of data samples (often called control points) is defined and projected, relying on global relationships. From this initial mapping, the remaining instances are positioned based only on their neighborhood relationships concerning each control point [12], which maintains the essence of a local mapping.
In summary, global techniques are often accurate in preserving distances average between objects, in global terms, but fail to preserve local relationships, besides being computationally more expensive. Moreover, according to Fadel et al. [6], preserving distances as a whole tends to distort small neighborhoods. Thus, when the purpose is to preserve neighborhoods, local techniques are more suitable than global ones.

(III) Other classifications
The taxonomy presented is the more widely adopted in the literature devoted to MDP. However, other classifications have been proposed to characterize the properties of different strategies. For example, projections can also be classified as interactive and non-interactive according to their ability to admit user interventions during the mapping process. In this sense, weighting mechanisms allow certain instances to have more influence on the mapping, while control points can be used as user-specified "anchors", which is an important step to guide the projection sequence behavior [7]. Sadly, choosing an appropriate set of control points to generate a mapping with less distortion as possible is a complex problem [12].
There are other possible classifications, such as the ones related to method stability, when the introduction of small perturbations in the data does not result in significant changes in the mapping. Another classification addresses multilevel projections that contain resources to allow the organization of large databases through hierarchies, creating different levels of abstraction to reduce the visual clutter [7]. Projection methods can also be classified according to their mathematical formulations related to the decomposition procedures or the optimization methods adopted to perform the mapping. Maaten et al. [16] and Nonato and Aupetit [7] conducted extensive reviews on techniques for reducing dimensionality, classifying them in other subdivisions.

Evaluation of Projected Spaces
Projection techniques can approximate similar instances and segregate distinct ones, grouping them in the visual space according to some (dis)similarity measure. Nevertheless, as aforementioned, MDPs can generate different levels of distortions in the data neighborhood. Therefore, to assess the distortions and quality of the views created, specialists map real data classes or groups to a color space, allowing them to evaluate the projection results visually. Sadly, this type of assessment can be subjective and complex due to the data cluttering on the screen, and specialists utilize some measures to assess projections' quality.
The best-known measure used to evaluate MDPs quality is the stress function [35], which estimates the amount of information lost during the mapping process of data instances. The function is similar to a standard deviation calculation and is defined as where δ ij is the dissimilarity measure in the original space, and d ij is the dissimilarity in the projected space. The stress function can quantify the distortion degree generated through the mapping process concerning the original space.
The results are in the [0, 1] range, and the closer to zero, the better the distances preservation, with zero technically indicating a "perfect" projection. Although the stress function helps evaluate the generated mappings, it is common to find divergences between stress and visual projection interpretation [9].
Another measure is the neighborhood preservation index [10], which assesses the rate of neighbors that are close in the high dimensional space and remain close in the projected space. Some projection technique generates neighborhood distortions. For example, two instances close to each other (or similar) in the original space can be projected too far, which is a missing neighbor problem. On the other hand, two instances distant from each other (or dissimilar) in the original space can be projected into the same neighborhood, a problem that is named false neighbor (see Figure 1 in [7]). Note these two kinds of distortions that can affect the neighborhood preservation index. False and missing neighbors impact the projection-based analysis in different ways [36] and can generate a degree of uncertainty during the data analysis task.
The index is computed using: where kNN(·) is the set of the k nearest neighbors of some instance x i in the original or y i in the projected space [6].
To evaluate the projection of the entire dataset, one can take the mean of NP k (i) values in the [0, 1] range, with results closer to 1 indicating mappings with better preservation of neighborhoods concerning the original space.
The silhouette coefficient [37] is one of the more widely used qualitative measures to check clusters' quality, based on calculating the cohesion and separation between groups of objects. Nonetheless, it has been primarily applied to assess the quality of projected spaces [38,39]. The cohesion a i of some instance x i is calculated by averaging distances between x i with all other instances belonging to the x i group. The inter-group separation b i is defined by the minimum distance between x i and all the other instances belonging to the other groups [12]. Therefore, the silhouette coefficient of some projected data instance x i is computed as: and the average Silh of all these values determines the average width of the silhouette coefficients of a dataset, considering Silh ∈ [−1, 1], and the higher its value, the better the intra-group cohesion and inter-group separations. That is, projections that present a silhouette close to 1 will have instances of the same group close to each other, with the different groups distant from each other. The measures described here present features considered an initial step to quantitatively assessing MDPs. However, other measures are developed to evaluate the layout generated by multidimensional projection techniques. A comprehensive analysis of the different qualitative measures applied in the context of projections can be found in Bertini et al. [40].

Influence of Graphical Perception and Visual Properties in Projection Analysis
Cleveland and McGill [41] defined visual perception as the ability of users to interpret visual encodings and understand the information presented graphically. It is challenging to extract the significance of complex data structures since their instances bear multiple attributes that may vary individually and simultaneously. The well-known Gestalt theory states that the global pattern perception of a scene cannot be explained by the sum of parts [42]. In other words, a single element can present a specific context, but when associated with other elements, they can represent different characteristics and contexts.
All in all, users need tools that explore the complex data features to reduce the cognitive load involved in analyzing massive amounts of data, such as multidimensional projection. However, how do we present data to users to explore their cognitive abilities to make the information presented significant? In graphical terms, the Gestalt theory formulated some interesting principles to "setup" the scenario more understandable, and we summarized it as follows: • Proximity principle-visual elements close in the visual space are preattentive (intrinsic and uncontrolled) processed as a group that shares similar features, even if instances are grouped in a not explicit way; • Similarity principle-elements represented by visual structures that share similar features (size, color, orientation, symmetry, parallelism) are perceptually grouped; • "Common fate" principle-visual elements that undergo similar visual transformations tend to be mentally grouped. The dynamism of movement helps the viewer to perceive which objects are related to the same action; • Closure principle-elements delimited in areas with clear contours tend to be visually grouped, even if they are not entirely continuous.
Color, position, size, user interaction, and other graphical properties are essential for a suitable visualization layout. Moreover, a comprehensive MDP tool must present accurate layouts that follow graphical properties that explore the human's cognitive abilities, leading to the visual transformation of raw data into information. The graphic metaphor commonly applied to display data resulting from a projection is the scatter plot [43].
A classical 2D scatter plot maps values of a data object using symbols on the Cartesian plane. For example, let S ∈ R 2 be a dataset whose s instances record the mass of an element measured over the interval t. Then, for each i point, the i-th symbol position relative to the abscissa axis is horizontally determined through the t i value in which s i was measured. At the same time, the position relative to the ordinate axis of that i-th symbol is vertically determined by the s i value, measured in t i . The advantage of a scatter plot is the consumption of only a few pixels to represent a single data object, standing out as a space-efficient chart. In addition, they are well-established metaphors commonly used in the scientific and business context [44].
According to McLachlan et al. [45], to learn quickly about visualization tools, one must explore the user experience and intuition; therefore, graphic representations should be familiar to the user. In this sense, scatter plots are straightforward; however, they have limitations and could suffer from visual scalability. Occlusion occurs as the number of points becomes larger than screen resolution [46]; overlapping occurs when there are very close dots on each other, obscuring the individual legibility of the symbols [44]. An important issue regarding the scatter plots in a multidimensional projection context is: that there are no coordinated axes. The reason is that variables belonging to different scalar types are processed to extract similarity information between each instance, and, in the visual space, the transformed data points must express their relative position as information. According to Nonato and Aupetit [7], references such as orthogonal axes perturb the analysis by raising questions about the meaning of the axes, inducing users to read the absolute positions of points in visual space.
In the following sections, we will discuss approaches and techniques that apply the points reviewed in this section.

Multidimensional Projection Approaches and Domains
This section addresses the main dimensionality reduction techniques found in the literature, emphasizing the review of well-established multidimensional projection approaches. This methodological choice is justified due to the recent interest in applying this approach in the information visualization context, that is, when the p-dimensional space is 2D or 3D. Here, each method will be described in a subsection containing discussions about its characteristics and properties and some variations and improvements. The following subsections were organized according to the development chronology of the methods.

Principal Component Analysis
Principal Component Analysis (PCA) was introduced in the earlier 1900s. It was probably the first dimensionality reduction technique and has been well-known and used [47]. PCA maps data from a high dimensional space to a low one by encountering orthogonal linear combinations that better represent original data variability. These combinations are named principal components. The process is performed so that the first component is related to the highest data variance direction; the second component is related to the second higher variance, orthogonal to the previous component, and so on.
One can obtain the components a 1 , a 2 , . . . , a p by calculating the covariance of the m attributes of a dataset X (Section 3.3). Being x i and x j two attributes of X andx i , andx j their means, the covariance between them is defined as With this formulation, it is possible to build a square matrix C m×m with each position representing the covariance between each data attribute pair. Following, spectral decomposition can be applied to encounter eigenvectors and eigenvalues by writing C as: such that Λ = diag(λ 1 , λ 2 , . . . , λ m ), with λ 1 ≥ λ 2 ≥ · · · ≥ λ m , is a diagonal matrix with the eigenvalues of C and U m×m is the orthogonal matrix with its eigenvectors. Finally, the principal components that represent the p-dimensional space are generated by the application of eigenvectors in the data matrix: with each u i being the columns of the U matrix. It is possible to obtain the number of principal components equal to the number of original data dimensions. However, the number p of dimensions should be sufficiently large to represent the initial space without significant information loss. It is important to observe that the first components generated by PCA can absorb the more significant part of data variance. This characteristic makes PCA a suitable approach to identifying data trends and patterns [48], beyond weeding out part of data noise and catching data variability in a few dimensions.
Due to numerical and practical questions, PCA is implemented by Singular Value Decomposition (SVD), which can be similarly interpreted as the eigendecomposition aforementioned. The process centers on X columns, subtracting each value by the respective column mean. SVD decomposes the centered matrix with being U columns left singular vectors, V columns right singular vectors, and S a diagonal matrix with the singular values. Then, the p-dimensional space is created with A = US [49]. In this case, covariance was not calculated, but the centralized data force the meaning of singular vectors and values to be similar to eigendecomposition results. As a projection technique, PCA preserves more global structures than the local ones, owing to the preservation of the large variances that appear in orthogonal principal axes [7,15]. Nevertheless, it is not capable of well-representing datasets formed by nonlinear relations among attributes [31]. Besides, PCA is unsuitable for representing data with several groups that have distinct variances since, for visualization purposes, just 2 or 3 first components are used [31].

Multidimensional Scaling
One of the first global multidimensional projection techniques is Multidimensional Scaling (MDS), a classic algorithm that originated from psychophysics, becoming famous after the Torgerson [50] paper. Nowadays, MDS comprises a family of nonlinear methods that seek to define an injective mapping of data objects from a multidimensional space to a lower one, maintaining data distance relations [9]. The way that preserves distances throughout the transformation process determines the differences among approaches based on MDS.
Classical Scaling is the best-known MDS method. This technique builds a transformed space by applying spectral decomposition of a symmetric matrix generated from distances of multidimensional objects [6,27]. More specifically, it seeks to satisfy δ(x i , x j ) = d(y i , y j ), in Euclidean space, with δ(x i , x j ) being the dissimilarity between objects x i and x j , and d(y i , y j ) the dissimilarity between projected objects. The step of spectral decomposition is similar to the step demonstrated in PCA (Section 4.1).
The outcomes attained by Classical Scaling are considered satisfactory and precise concerning global distance preservation. However, its computational complexity of O(n 3 ) bounds the approach application. Thus, applying Classical Scaling in large datasets (more than millions of items) is not viable. To mitigate this problem, some alternatives were developed, such as Landmark MDS (LMDS) [51] and Pivot MDS [52].
Instead of decomposing the complete dissimilarity matrix, LMDS selects an initial subset from the original dataset containing s reference instances named landmarks. These reference instances can be selected randomly or by an algorithm that maximizes the distances among them. The approach performs Classical Scaling in the landmarks subset, projecting these reference points on the R p . Finally, the remaining data instances are mapped on the new space through distance-based triangulation.
LMDS preserves MDS characteristics and is more efficient with its complexity of roughly O(s 3 + sn). Nonetheless, it should be selected at least p + 1 landmarks to generate a good projection.

Sammon's Mapping and Related Projections
In some cases, the obligation of dissimilarities maintenance during the transformation process can be restrictive, generating poor mapping. In this scenario, techniques based on nonlinear optimization emerged. This class of global methods is derived from MDS theory and maps original space to the visual space by minimizing a loss function g in such a way that d( Gradient descent is a well-known algorithm for minimization problems, and it is used in the following projection methods. Gradient descent walks in the opposite sense of the function gradient −∇ f ( x(k)), with steps that have an α size (varying each iteration defined by a line search satisfying the Wolfe conditions or the Barzilai-Borwein method). The step size is defined by applying some algorithm or a heuristic method. Finally, each update step is defined as ) and the algorithm is finished when the gradient function stops the changes or a maximum number of k > 0 is reached. The gradient, also known as the slope of a line, is defined as the ratio of the change, and whenever the gradient is perpendicular to x(k), then it will be the negative reciprocal of the original line.
Kruskal [35] was the first one to model the projection problem as an optimization problem that seeks to generate a layout that minimizes the quadratic difference between original data dissimilarities and the projected data distances. This quadratic difference is well-known as the stress function (Equation (5)) and will be identified here by S T . Then, Kruskal [35] used Gradient Descent to find values that minimize the stress function. With x(k) being a vector with all the n instances of X after the k-th interaction, the minimization process updates the mapping with the following equation: In this process, the normalization of the gradient is applied not to change the magnitude of x(k).
Kruskal's method is related to Sammon's Mapping [53], one of the best-known dimensionality reduction techniques in the InfoVis area. The latter introduces a loss function normalization in a different way, which can be figured out when comparing Equation (5) (used by Kruskal) and Equation (13) (applied by Sammon). Sammon [53] mapping also uses Gradient Descent to minimize the information lost in the transformation process while preserving the global data dissimilarities of the original space by finding a local minimum solution [9,54].
Note that Equation (13) is weighted by the sum of δ(x i , x j ) −1 , which gives more importance to small dissimilarities. This characteristic enables Sammon's Mapping to project data with high dimensionality; however, it can generate distortions in multidimensional data with nonlinear relations, an issue that also appears in Kruskal's method [9].
Sammon [53] defined the update steps as the following equation: such that the step size β can be into the interval 0.3 ≤ β ≤ 0.4 and defined by: Another issue related to Sammon's and Kruskal's methods is the quadratic computational complexity (O(kn 2 ) operations, see Table 1). To mitigate it, Pekalska et al. [54] developed a strategy that, in the first step, maps a subset with s samples to visual space using Sammon's Mapping. Then, the remaining samples are positioned by applying some interpolation method, such as triangulation, neural network, or linear transformation. With this change, the approach attained a good precision and improved the efficiency with the complexity of O(s 3 + sn) [12]. On the other hand, the new approach needs representative samples of "high quantity" to perform the initial projection, with the authors suggesting half of the dataset, n/2. Table 1. Multidimensional projection methods. The column "Complexity" presents the computational load related with the amount of instances (n), dimensions (m), iterations (k), samples (s), and graph edges (E).

Technique
Transformation Nature Complexity

Isometric Feature Mapping
Another important variation of Classical Scaling is the Isometric Feature Mapping (Isomap) [55], a global nonlinear projection technique. Isomap transforms the distance relationships among multidimensional data instances before projecting them into the visual space [31]. Isomap aims to capture the data's topological (manifold) structure by computing geodesic distances by turning the dataset into a graph structure and considering distances based on the shortest path between vertices. Isomap is probably the first projection technique to resort to topological data analysis mechanisms to achieve dimensionality reduction [56].
Isomap builds an undirected weighted graph G = (V, E), such that v i ∈ V represents a data instance x i and w ij = δ(x i , x j ) is the weight function. The graph edges can be created with the k-nearest neighbor (k-NN) approach, which considers a data distance to define each e i ∈ E. Therefore, the k closest vertices to a specific vertex v are connected. With this process, the δ function information is surrogated by the shortest path between points calculated by some path algorithm, such as Dijkstra [57]. Moreover, the graph's adjacent matrix D is considered as the "distance" matrix of data, with cells related to data with no edge filled up by ∞ value.
Finally, the visual space is created by applying some projection technique, such as MDS, in the D matrix. Sadly, the calculation of distance and the spectral decomposition affect computational performance that attains complexity O(n 3 ). The topological stability of Isomap depends on the correct choice of k [58]. Moreover, isometric mapping of high dimensionality into low one is possible only in particular conditions, resulting in mistakes and distortions [7].
Silva and Tenenbaum [59] created a more computationally efficient version of Isomap, named L-Isomap. First, it selects s landmarks to generate the graph representing relationships among the landmarks and the other data instances. After that, the LMDS technique is applied to project the landmarks and approximate the positions of the remaining points. It reduces the Isomap complexity to O(sn log n).

FastMap
Faloutsos and Lin [60] created the FastMap projection, a global nonlinear technique that reaches high efficacy but generates low accuracy, depending on the structure of the projected data. FastMap maps objects from an R m space into an R p , with p < m, where each data item is projected into p orthogonal hyperplanes representing projected data coordinates. The approach's core is to project items in a particular line segment contained on a hyperplane.
The process starts with a set of instances O and the distance matrix D generated with the Euclidean distance function δ of the instances. As depicted in Figure 2, two instances, O a and O b , furthest apart, are selected from the high-dimensional space. This selection requires O(n 2 ) computations; however, Faloutsos and Lin [60] also proposed a linear heuristic to perform it. Firstly, it selects an arbitrary instance O r . In the sequel, it chooses the farthest apart instance from O r to be the O a pivot. Finally, it selects the instance that is farthest apart from O a to be the O b pivot. The line O a O b is into a hyperplane Y, and it will represent the first coordinate of the new projected space R p .
The next step involves calculating distances in the hyperplane Y, generating a new D matrix. As described by Equation (17), the new distance is related to the previous one and the difference between the coordinates of the new point-object: Lastly, other dimensions can be generated by applying the same process. For example, according to Figure 2, starting now with the new distance matrix, encountering new pivot objects in a hyperplane H, calculating the position of all objects on the pivot line, and creating another distance matrix. This process guarantees that the current line O a O b , and consequently the hyperplane H that contains it, is orthogonal to the previous one. The process continues until all p dimensions are calculated.
FastMap is based on classical Euclidean geometry, requiring only data dissimilarities and employing just two representative samples to guarantee orthogonality of dimensions to lead data projection [6]. The algorithm has linear complexity, being faster than other ones. Sadly, it presents high levels of information loss and can not represent nonlinear datasets properly.

Force-Based Placement
Force-based techniques are simple approaches applied to data projection that consider data instances connected by "virtual springs" and attempt to minimize the sum of the forces that act upon each data instance. Low dimensional positions are randomly initialized, and the repulsion forces between the springs adjust positions towards balancing [9]. The forces are calculated by δ(x i , x j ) − d(y i , y j ), being δ and d dissimilarity functions in original and projected space, respectively. Therefore, distant points in the original space repel each other in the projected space, and close points in the original space attract each other in the projected space.
Sadly, the mapping accuracy is impaired by the computational complexity of O(n 3 ) (n iterations of O(n 2 ) force calculations), making its application in large datasets unfeasible. Chalmers [61] handled this problem with a limited-size neighborhood of each data instance. After randomly initializing the projected space, considering an object x i , the process creates a V i subset that contains k-near-neighbor of x i and calculates δ maxi with the maximum dissimilarity between x i and the members of V i . In an iterative process, a subset S i is loaded until reaching k instances. Each iteration randomly selects a new x j object that does not belong to V i . If δ(x i , x j ) < δ maxi , then x j is inserted into V i (since the corresponding point is deleted from V i ) and δ maxi is updated; otherwise, x j is inserted into S i . When S i is fulled with k instances, Equation (18) returns the resultant force to be applied to y i that is linearly proportional to δ(x i , x j ) − d(y i , y j ).
Therefore, instead of calculating n(n − 1) forces in each iteration, Chalmers [61] seeks to preserve original distances in projected space, considering a reduced subset of instances. Despite the computational complexity improvement and preservation of small neighborhoods with linear iterations [6], O(n 2 ) operations to balance the layout still hinder its application in real applications [9].
Morrison et al. [62] also tried to reduce the computational complexity with a hybrid version of the technique based on a sampling strategy. The process generates a random sample with (n) instances and projects them with Chalmers' model. The n − (n) remaining data instances are interpolated with a modified version of the Brodbeck and Girardin [63] strategy. These modifications reduced complexity to O(n √ n), but it can be inaccurate when applied to complex data structures or can get stuck in a local minimum [64].
Another approach was proposed by Tejada et al. [26] and named Force Scheme. This projection moves all points in a N i neighborhood towards an y i point instead of moving y i towards its neighbors. In other words, each y i attracts or repulses other y j points during an iteration. Positions are updated with a force fraction considering the residual distance function between original and projected spaces, defined as where δ max and δ min are the more significant and the smallest dissimilarities in the original space, respectively. The formulation respects original global dissimilarities, but this characteristic allows the large dissimilarities to dominate the projection process and distort small local neighborhoods. Consequently, in datasets that contain instances with high dissimilarities, the process tends to concentrate points in the out-most edges of the projection. Force Scheme can generate accurate layouts and reach stabilization with few iterations compared with other force projections. Sadly, the amount of operations in each iteration is O(n 2 ), which is high computational complexity. Ingram et al. [64] developed Glimmer, a technique that seeks to reduce the processing time of forces calculation using the processing power of GPUs (Graphics Processing Units). The algorithm projects the data with a multi-level variation of Chalmers' model, starting with one data instance and adding new instances until the whole dataset is mapped to the projected space. The method presents a good data cohesion and separation performance, reaching small stress values. Sadly, even with the efforts to reduce computational complexity, force-directed projections are unfeasible for huge datasets [12].

Stochastic Neighbor Embedding
Hinton and Roweis [65] developed the Stochastic Neighbor Embedding (SNE), a nonlinear statistical technique based on the minimization of the Kullback-Leibler (KL) divergence [66] that takes probabilities distributions to represent dissimilarities from original and projected spaces. SNE models two Gaussian distributions, one P over the original m-dimensional space and another Q over the projected p-dimensional space (initialized with random positions). In this scenario, pairs of instances with small Euclidean distances are associated with high probability values, and pairs with large distances have low probabilities.
Equation (20) defines the conditional probability p i|j between each par of instances x i and x j , being δ the Euclidean distance and σ i related to the size of the x i neighborhood, being smaller in dense regions and bigger in the sparse ones [67]. The best value of σ i is encountered by a binary search in the [1, perplexity] range in such a way that the Shannon entropy of p probabilities reaches log 2 perplexity, being perplexity frequently defined by the user in the [5,50] range [65]. Meanwhile, Equation (21) defines probabilities in p-dimensional space, being σ i = 1/ √ 2 without generality loss. In both spaces, selfprobabilities are null, i.e., p i|i = q i|i = 0.
KL divergence between probability distributions P i and Q i is minimized by the application of the gradient descent algorithm [68] to discover the best values of p-dimensional coordinates [67] such as in Equation (22). arg min This optimization process can be interpreted as a system with attraction/repulsion forces between y i and all y j points because it approximates similar points and segregates the dissimilar ones.
Sadly, KL divergence generates space distortions due to its asymmetric characteristics and is sensitive to the crowding problem [28]. These issues were mitigated by Maaten and Hinton [28] with the introduction of t-Distributed Stochastic Neighbor Embedding (t-SNE), which applies to the original space a symmetrized version of conditional probabilities, such as being p ii = 0. In the projected space, the authors replaced the Gaussian distribution with the t-Student distribution because the latter is simpler and faster to compute while maintaining similarities with the former distribution [28]. In addition, at the projected space is defined q ij = q ji and q ii = 0. Both SNE and t-SNE can represent global structures and reveal local data characteristics, preserving some levels of data neighborhood. The t-SNE and other similar techniques are robust against norm concentration, a typical characteristic of the curse of dimensionality [7], because of the shift-invariant similarities [69]. Although attaining good results if compared to other methods, in t-SNE projections, the observed distance between clusters is unreliable and sensitive to parametrization (perplexity parameter) [70], which can generate a misleading effect of cluster distance and shape resulting from the local nature of optimization and how hyperparameters are set up [8]. Nonetheless, t-SNE presents computational complexity equal to O(n 2 ), which impairs practical applications, but Maaten [67] developed an accelerated version that reaches O(n log n) through approximations.

Least Square Projection
The Least Square Projection (LSP) [9] is a nonlinear global projection that preserves neighbor relationships at a high level and allows user interactions. It consists of two steps: (i) a small subset of data samples is carefully chosen and projected with some MDS approach; (ii) the positions of the remaining data instances are calculated with a linear system that considers the previous projected points and a Laplacian matrix operator that maps the neighbor relations of each instance to place it close to its nearest points.
The initial subset of samples, named here as control points, leads the geometry of the final map, and its position in the projected space significantly impacts the LSP result; thus, they must accurately represent the different groupings of instances from the original space. To select them, a dataset X with n instances is split into s = √ n clusters using the k-medoids algorithm [13], and the clusters' medoids are then considered as the control points. According to Paulovich et al. [9], the definition of s with the squared root of dataset size can accurately represent the whole dataset, keeping LSP complexity feasible. An MDS technique or a user interaction-based method is applied to lay the control points out, allowing the user to manipulate the mapping process.
In the next step, LSP maps the remaining instances x i with a strategy that generates a list V i with k neighbors of x i . The coordinates of the projected point y i are generated through the solution of Equation (24) that bounds y i points to their neighbor convex hull. Each α ij weights the impact of neighbors over y i position and when α ij = 1 |V i | , y i turns into V i centroid.
The solution of Equation (24) reaches the linear system LY = 0, where L n×n is a Laplacian matrix (a matrix representation of a graph) and its elements are defined as: Sadly, L has no geometric information, and the system solutions can be useless. This issue is handled through the insertion of new lines in L containing geometric information of the projected control points turning the Laplacian system into a new non-homogeneous system AY = b, where A = L S , S s×n , and b vector (let us considerȲ the inner product space) defined as: LSP calculates the system solution with the least square method [68] minimizing AY − b 2 since an exact solution is unlikely. As a result, the final system is sparse and symmetric, which facilitates an interactive solution [9].
LSP presents a good level o neighbor preservation, but there are n variables to solve the system with an O(n 2 ) complexity, which is infeasible in huge datasets [6]. Then, Paulovich et al. [71] created Piecewise Laplacian-based Projection (PLP) to mitigate this com-plexity issue. PLP splits the dataset and maps the subsets with LSP. The spatial coherence is achieved through the initial projection of the control points, and the computer time is reduced because PLP calculates small systems instead of a big and more complex ones. However, this solution is often less accurate than LSP [6].

Part-Linear Multidimensional Projection
Part-Linear Multidimensional Projection (PLMP) [72] is a global and partially linear multidimensional projection method that can deal with huge amounts of data due to its low complexity level. It consists of two steps: (i) a nonlinear step that projects a subset of representative samples; (ii) a linear step that projects the remaining instances with a linear transformation, justifying the "part-linear" name.
PLMP selects s control points (see Section 4.8), a subset of representative instances of X dataset that are projected with the Force Scheme, generating theȲ low-dimensional set. In the next step, a linear transformation Φ : R m → R p minimizes the differences between the projected distances and the original dissimilarities, such as x i , x j ∈ X and L m,p being a space of linear transformations that maps R m to R p . Sadly, Equation (27) is computationally impractical in large datasets. To mitigate this issue, PLMP determines an approximationΦ, which takes information from a subsetX of the projected control points. Consequently, the final solution is a linear mapping that approximates the transformation of the control points [72].
To execute the approximation, PLMP creates a new systemΦ m×pXs×m =Ȳ s×p , wherē X andȲ are, respectively, the control points and their projections. Each line ofΦ m×p generates a sub-system, and the least square is applied to solve it. The projection of the remaining instances is achieved by repeating the process to Φ (Equation (27)), that is XΦ = Y, a simple matrix multiplication with O(n) computational complexity.
As in the before-mentioned techniques, the users can manually define the position of the projected control points. Nonetheless, PLMP requires a number of control points superior to the data attribute number, that is, s > m, to build a good quality layout, and this characteristic can limit its use with very high-dimensional datasets such as text represented with a bag of words.
The results presented by Paulovich et al. [72] highlighted that PLMP is faster than projections such as LSP. Moreover, besides the initial step, PLMP formulation does not need data dissimilarities information, reducing computational complexity.

Local Affine Multidimensional Projection
Following Section 3.4, projection techniques that apply linear transformations are based on one function that maps high-dimensional data into the visual space. This function can be very generic in dealing with all data characteristics, even when it is optimized to treat each data subset, such as PLMP (see Section 4.9). On the other hand, most of the nonlinear transformations depend on dissimilarity matrices whose calculation demands high computational complexity. Some techniques project representative instances to guide the process to outperform complexity-related issues, but it demands the choice of several samples to execute the initial projection.
In this context, Joia et al. [12] developed the Local Affine Multidimensional Projection (LAMP). This method also uses representative instances (control points) to guide the projection at the first step. It then maps the remaining instances locally through interpolation using a family of orthogonal affine mappings-one linear function for each instance to be projected.
More specifically, LAMP executes a process similar to LSP (see Section 4.8), where s representative samples are randomly selected from a dataset X and projected on the visual space with Force Scheme (see Section 4.6). With this initial information from control points' positions, each remaining instance x i ∈ X will be mapped with a linear function f X (p) = pM + t, that matrix M and vector t are unknowns (by taking partial derivatives with respect to t equal to zero, one can write t in terms of M) that minimizes: respectively, a control point and its projection, and α i = 1 x i −x 2 a weight that leads the control points to influence other data instances, where α = ∑ i α i . Finally, the matrix M is going to be computed from the product of the left and right singular vectors derived from the SVD of the vector column (x) × vector line (ȳ) (for further information, please see Section 3.1 in [12]). Consequently, the more a point has similarities to x i , the more it affects Equation (28), which reinforces local characteristics of the LAMP process.
Following Joia et al. [12], the orthogonality restriction imposed on Equation (28) guarantees an isometric transformation, avoiding scaling and shear effects. Moreover, the restriction reduced the high error levels generated during the projection of control points and propagated during the other transformations, keeping the (inevitable) distortions as small as possible.
The authors rewrote Equation (28) into matrix notation and generated a Procrustes Orthogonal problem [73], a well-known mathematical problem of matrix approximation. This equation can be expressed as AM − B F , tal que M T M = I, being · F the Frobenius norm and: Finally, the matrix equation is solved using SVD, being M = UV and A T B = UDV, which demands O(s) operations. As a result, the position of data points is obtained with LAMP was classified as a hybrid projection because its behavior can be local or global depending on the number of control points [6], which can be seen as a positive correlation, i.e., the more control points, the more global will be the mapping. Joia et al. [12] performed tests that revealed LAMP generates high-quality layouts even with a small number of control points. The formulation makes LAMP efficient with O(sn) computational complexity.

Hierarchical Approaches
Hierarchical Point Placement Strategy (HiPP) is a projection technique built to preserve the clustering and segregation of a dataset [10]. It generates a hierarchical structure that allows exploration in several levels of detail. With this asset, HiPP can be applied to larger datasets than several other projections. The authors reported that all process has a computational complexity of O(n √ n), being n the total number of data instances. The authors divided the HiPP process into three stages. In the first one, HiPP creates a tree that has all data instances as children of the root node. With Bisecting k-means [74], the children's level is split into k clusters. One node is inserted as a child of the ancestor node representing each cluster created. After this, all cluster items are connected with their related group node. This process is recursively applied to each new node until a global threshold of items inside each node is reached. Paulovich and Minghim [10] used a threshold equal to the square root of the dataset instances, √ n, and a local k = √ m, with m being the number of items inside the node (group) being divided.
In the second stage, the authors applied LSP to project tree nodes. They represented groups as circles, being its center point related to the group centers, and the size of the circle proportional to the number of items in the group. Thus, the first level is projected with these attributes, placing the circles on the plane. Next, internal data are projected when one interacts with a circle (group), zooming in and out on the structure.
In the third stage, a spreading algorithm removes overlapping in the visual layout. A vector between every two nodes is calculated and used to spread them. The nodes are dislocated in the vector orientation but opposite senses.
Furthermore, the authors used HiPP to explore text datasets. Thus, they present topic terms as labels of each group and colored circles based on topics. The technique also allows split and reassemble groups interactively. HiPP is suitable when one needs to tailor clustering. On the other hand, it does not allow for flexibility of clustering and projection techniques or data types.
To enhance HiPP capabilities, Dias and Minghim [75] proposed the eXtend Hierarchical Point Placement Strategy (xHiPP). First and foremost, the authors use three cluster techniques in the tree construction process: k-means, k-medoids, and a basic Hierarchical Agglomerative Clustering. Instead of using √ n to decide how many groups will be generated, the authors used Sturges' Rule [76], which defines k = 1 + 3.3 × log(n). It results in a smaller number of clusters than the previous technique, which is better when one works with a large dataset. In addition, the user also can modify this parameter.
In the second change proposed, the user can choose which projection better fit the data under analysis. It is possible to choose among LSP MDS, Force Scheme, t-SNE, PLMP, LAMP, or the PCA, all previously described in this review. Force Scheme is always used to project internal data to improve performance.
Dias and Minghim [75] allowed xHiPP to cluster data before the projection stage (such as the original approach) and also the projection stage, followed by the clustering stage. It takes advantage of the projection techniques' ability of data points that have come together in projected space and partition the dataset with high precision.
Beyond text exploration tools implemented in the original HiPP, the authors added word clouds to summarize the content of both groups and individual documents. Authors also added functionalities to facilitate the exploration of image datasets, audios, and general data. The groups' medoids are used to represent image sets and spectrograms of audio ones. Heatmap images are employed to map attributes' distribution of general data. The capabilities to present text, image, audio, and value attributes of general data were added to support the examination of these data types. If data are labeled, colors represent the node and predominant group labels.
Following the authors, with these changes and additions, it is possible for the user to combine different methods to find the better set of them that represent the data under analysis.

Local Convex Hull
The Local Convex Hull (LoCH) [6] is a local technique built to project data embedded into multidimensional sparse space through three steps. In the first step, LoCH generates √ n data clusters and finds their medoids (similar to LSP control points seen in Section 4.8), and these medoids are used to calculate distances among clusters. For each x i data instance, the projection creates a k-neighborhood, considering the data inside the s closest clusters of x i .
In the second step, the clusters' medoids are projected with Force Scheme [26]. According to Fadel et al. [6] testing, Force Scheme preserves global distance relationships despite the high-precision of its results.
In the final step, the remaining x i instances are projected, applying an interpolation that considers the previous projected points and the x i distance to the convex hull of the k-neighborhood generated in the first step. Based on this idea, the x i position in the projected space is determined by the linear combinationŷ i = ∑ α j y j ; then, With this formulation, the distance relationships are preserved; however, it does not guarantee thatŷ i is placed inside the correct convex hull. Therefore, the initialŷ i is iteratively moved towardỹ i : In this context, γ i is a preservation factor of the original distances between x i and its neighborhood N i . This process needs almost √ n iterations to approximate y i and the convex hull [6]. LoCH is simple to implement and presents a low computational complexity, namely O(n √ n). Sadly, LoCH characteristics (well suited to multidimensional sparse datasets) limit its application in a large range of applications.

Uniform Manifold Approximation for Dimension Reduction
Uniform Manifold Approximation and Projection (UMAP) [29] is a multidimensional projection technique with a solid base on geometry and topology theories. UMAP is a manifold learning method that relies on Riemannian geometry and algebraic topology to perform dimensional reduction [77]. The authors highlighted that UMAP is scalable, which turns it suitable in real applications with large datasets. Furthermore, UMAP presents results as good as t-SNE (currently one of the widely used projection techniques) in visualization quality; nonetheless, it better preserves global data structures and is faster than t-SNE. In addition, UMAP was developed with machine learning applications in mind [8].
To perform data mapping, UMAP considers the following assumptions: (i) data are uniformly distributed in a manifold, and (ii) the manifold is locally connected, i.e., there are no isolated points. McInnes et al. [29] used manifold approximations and patching associated with local fuzzy simplicial set [78] to construct a topological representation for original and projected spaces. Following this step, UMAP minimizes the cross-entropy function between the original topology and the projected one.
The authors presented mathematical bases and a practical and computational vision of the process. UMAP is described as steps that construct and manipulate a weighted graph in the latter. The first step creates a graph, considering the weight function as and σ i being calculated to satisfy: These formulations consider a parameter k and some similarity function d. Thus, the not directed graph G used by UMAP has an adjacent matrix B = A + A T − A • A T , being A a matrix generated with the previous weight function and the operator • of the Hadamard product [79].
UMAP applies a Force Directed approach to place the graph nodes in the projected space. The algorithm uses attraction and repulsion functions, respectively, defined as −2ab y i − y j where a and b are hyperparameters, with being a small number to avoid division by zero (the authors used = 0.001). McInnes et al. [29] highlighted that the initial projection could be randomly generated; however, they used a spectral layout to initialize the projection. As a result, it provides faster convergence and superior algorithm stability. In the last projection step, UMAP minimizes with gradient descent a smooth approximation of the strength between two elements (the original data instance and the projected one), which is defined as Φ(x, y) = (1 + a( x − y 2 2 ) b ) −1 . The parameters a and b are defined with a nonlinear least-square that fits one function dependent on the original instance and projected point position and a parameter that defines the desired separation of the points on transformed space.
As stated by its authors, UMAP provides highly similar visual results to t-SNE, and for this reason, it is often seen as a modern, faster, and more scalable alternative to t-SNE, providing visually similar outputs [56]. Figure 3 depicts a comparison between UMAP and t-SNE applied to well-known datasets, namely COIL20 [80], MNIST [81], Fashion MNIST [82], and Google News [83]. One can figure out that UMAP results are comparable with t-SNE ones, segregating and maintaining the same data structures.

TopoMap
Following the topological approach, Doraiswamy et al. [56] developed TopoMap, a projection method devoted to mapping data from a high-dimensional space to a visual space preserving the global 0-dimensional topological persistence diagram defined by a Rips filtration over the m-dimensional data. Furthermore, TopoMap ensures that Rips filtrations generate the same connected components when applied to the original and the projected data by preserving the minimal spanning tree (MST) of the original data embedding in the projection [84].
Geometric relations such as distances or the proximity relationship between data instances are not the only attractive property to be preserved in a projection. According to TopoMap's authors, particular structures such as clusters and outliers could be more reliable and meaningful if the mapping reveals some topological invariants such as connected components and loops.
In contrast to dissimilarity-based projections, TopoMap is based on the non-differentiable 0-homology topological persistence evolution of cycles in the simplicial complexes resulting from a Euclidean distance-based Rips filtration [85]. Given a set of m-dimensional data instances P, TopoMap builds a complete weighted graph over P, weighting each edge (p i , p j ) with the Euclidean distance between the corresponding endpoints, d(p i , p j ). The Rips filtration [86] grows a high-dimensional ball around the data instances, adding an edge (simplexes) to the filtration, one at a time, when two (or more) balls intersect each other. The addition of each new simplex can change the topology. The ordered set of changing edges is the MST of edges computed over the graph. TopoMap projects the data while preserving the evolution of cycles, such that the topological filtrations over the high-dimensional and projected data generate the same connected components (0-cycle) at the same instances of the respective filtrations.
TopoMap is a different approach from Isomap and UMAP. Isomap does not consider 0-homology groups, which are therefore not preserved. UMAP is based on category theory [29] and TopoMap focuses on persistent homology. Topological methods have been used to evaluate the projection process [87]. Since TopoMap bears theoretical guarantees, a side contribution is that it can be applied as an analytical tool helping to probe and illustrate how other projection methods behave, especially regarding distortions and cluster preservation. In this context, Doraiswamy et al. [56] demonstrated that clusters visualized in a t-SNE layout, in fact, tend to correspond only to pieces of clusters present in the m-dimensional data.
TopoMap can be used with an alternative distance metric than Euclidean distance. However, changing the metric requires computing the MST using this new metric, which means the running time for computing the MST will degenerate TopoMap complexity to O(n 2 ) due to the computation of the distance matrix.

Graph Regularization Multidimensional Projection
The recently introduced Graph Regularization Multidimensional Projection (GRMP) [77] generates two-dimensional visualizations (2D) that reproduce the groups present in the high-dimensional space while preserving distance relationships between dataset instances. In addition, based on a similarity graph, GRMP uses the Graph Signal Processing theory (GSP) [88] and graph regularization as fundamental tools to incorporate these groups' information into the projected space.
A graph is a structure that represents data instances (vertices) and their relationships (edges). Signal theory adds extra information to the graph allowing the processing of a signal through the graph structure. In this context, graph regularization is based on signal smoothing by solving a minimization problem, which is equivalent to applying a low-pass spectral filter [77].
The GRMP method can be divided into three main steps. The initial step is related to the construction of the similarity graph. Similar to the previous topological-based methods, GRMP builds a similarity graph from the m-dimensional dataset matching data instances as the graph vertices set, establishing a one-to-one correspondence between instances and vertices. Every two vertices are connected according to the similarity relationship between their corresponding instances. In the second step, GRMP defines the coordinate graph signals in the 2D space through the spreading control given by a phyllotactic distribution. Note that the method moves away from all points using this distribution and only approximates the points that should be close in the final step, using regularization. In the third and last step, the coordinate graph is seen as signals to be processed and regularized using GSP tools.
GPS application is motivated because the neighbor points in the 2D space tend to have similar coordinates in the high-dimensional space. The smoothing effect of the graph regularization resets the coordinates of the points given by a phyllotactic distribution. The final GRMP projection is allocated in the visual space using control points selected from the medoids of each component. The control points are projected using MDS or other multidimensional projection techniques that preserve distances. Then, the smoothed components are translated and positioned, matching the control points' projection counterparts.
GRMP uses the k-NN graph [89] to build the similarity graph, which means GRMP inherits its main properties and difficulties in dealing with outliers. Parameterization is simplified in GRMP, comprising essentially only two parameters: the number of neighbors and the graph regularization parameter. However, the authors based their parameter choices on empirical tests referring to the need for fine-tuning to achieve better groupings.

SHAP Clustering
A fundamental challenge in the multidimensional projection domain is related to metric spaces. As we have seen through this section, most projection methods rely on the concept of distance to build similarities relationships between instances that will embed the high-dimensional space in the visual one. However, in projection and clustering tasks, we often need to handle data with features bearing very different types, i.e., features may be unitless scores, grams, meters, and others. Using features with different types as dimensions in a single multidimensional space forces any distance metric to compare the relative relationship in different units (grams vs. meters, for example) [90].
To tackle this issue, Lundberg et al. [90] applied SHAP values to convert the input features into importance values with the same units. SHAP (SHapley Additive exPlanations) is a computationally efficient way to verify the relationship between input features and the output of supervised machine learning methods. Specifically, SHAP is a feature importance model-agnostic method applied in the context of Explainable Artificial Intelligence (XAI) [91] to explain machine learning predictions through effect measurements reflecting the importance of input features. It is based on Shapley values [92] cost-sharing problem from classical cooperative game theory to share and allocate each feature a fair importance value for a particular prediction. The explicit definition is: where S ⊆ {1, . . . , m} is a feature subset of all m features, f (S ∪ {i}) and f (S) represents a mapping function (a trained machine learning model, for example) with the feature i present and withheld, respectively. Thus, the Shapley value of a feature i can be understood as a weighted average of i's marginal contributions to every possible subset of features [93], which means Shapley values exact computation is an NP-hard problem unfeasible to be applied over high-dimensional data. Then, Lundberg and Lee [94] proposed SHAP as an approximation of classical Shapley values based on statistical samplings. SHAP is currently a state-of-the-art XAI method used to explain machine learning predictions in the more variate domains, ranging from aerospace to medicine and social sciences [95][96][97]. Lundberg et al. [90] extended SHAP to unsupervised learning context presenting SHAP Clustering. SHAP Clustering converts all input features into SHAP values, unitless measures of feature importance. Moreover, even when all original inputs are in the same units, often some of them are more important than others. SHAP Clustering also introduces the concept of "supervised clustering" as a consequence of converting input features into SHAP values to obtain their effect information. In other words, SHAP values are log-odds attributing the impact of each feature to a machine learning model, then fluctuations in the feature values only affect the clustering if those fluctuations have significance to the outcome [90]. SHAP Clustering differs significantly from everything in previous multidimensional projection literature by applying concepts from game theory, positioning itself as an interesting and little-explored alternative. Moreover, SHAP Clustering handles one challenging issue in the multidimensional projection domain allowing direct comparisons between the relative importance of features bearing the more different units. Similarly, a recent approach used Shapley values as a tool to provide explanations regarding important features of the clustering process of multidimensional projection methods [98].
In the next section, we will discuss a comparative summary of the approaches reviewed in this section.

Discussions about the Multidimensional Projection Techniques
A challenging issue is a balance between projection quality and the required processing amount to generate good mappings. The high computational costs virtually block the application of some projection techniques in medium-sized datasets, even for the accurate techniques. As we saw above, an alternative strategy employed to mitigate the computational bound and enable the use of multidimensional projections, even in interactive and real-time applications, is the sampling of representative instances. From a subset of samples, it is possible to apply a precise (and more expensive) projection technique to generate the control points, which will guide the remaining mapping (now using a less expensive technique) in the visual space.
Although the original purpose of control points is to reduce computational time, control points can also be applied to incorporate some degree of user interactivity in projection processes [64]. However, there are still no comprehensive studies that quantify the impact introduced in manipulating control points by users, either in terms of quality or reasonability of generated maps [7]. Furthermore, the proper choice of representative samples is a central question in some circumstances. For example, authors such as Paulovich et al. [9], Paulovich and Minghim [10], Joia et al. [12], and Fadel et al. [6] suggested a random selection of large quantities of instances due to the difficulty in determining an ideal amount that effectively represents all the classes contained in the dataset, but exactly how large this must be this quantity is an open issue.
The larger the representative subset, the greater the computational effort required in the initial projection stage. Although it is statistically possible to infer that a random sample contains significant information from the dataset, there is no guarantee that it will always be true. Furthermore, the random selection introduces an undesirable effect: the same dataset is mapped into different layouts, something that can generate some confusion for the users. Thus, there is a need for an alternative to reducing the complexity of multidimensional projection techniques while producing effective and deterministic layouts in a reduced time. Figure 4 presents a performance comparison between some of the state-of-the-art multidimensional projection techniques. According to Fadel et al. [6], global techniques usually yield better dissimilarity preservation averages (stress) and computational performance. Note that LAMP, which is essentially a local method, is highlighted as one of the best balances between stress (see Figure 4 left-hand) and computational performance (see Figure 4 right-hand). Table 1 summarizes the main approaches reviewed throughout the text. In addition to these techniques, there are other derivations, modifications, and improvements. Maaten et al. [16] also carried out a valuable review of the subject. However, the interest in developing and improving multidimensional projection strategies has been intense over the past ten years. In this scenario, Nonato and Aupetit [7] carried out comprehensive and updated research about multidimensional projection methods, clarifying limitations and strengths from the perspective of visual analytics. As we are demonstrating here, the multidimensional projection has a large field of application possibilities. To finish this Discussion section, we highlight one emerging application in a new effervescent research domain: Explainable Artificial Intelligence (XAI). Artificial Intelligence has a significant growing development interest in the last decade, especially Machine Learning algorithms. On the one hand, Machine Learning models such as those based on Artificial Neural Networks and Gradient Boosting can outperform human capabilities in many tasks and, for this reason, have been often applied in decision-making processes [99,100]. On the other hand, these kinds of Machine Learning models are based on high-complex structures, which makes them "black boxes" virtually impossible to be interpreted, which raises several issues regarding trust in their predictions when applied to sensitive contexts, e.g., medicine, finances, autonomous cars, and so on [91].
XAI methods are designed to "open" the Machine Learning black boxes providing information to justify and understand the work logic of these algorithms. An interesting and few explored use of multidimensional projections is in the XAI context. We can cite applications of t-SNE as part of XAI proposes aiming to cluster images according to neuron activation in networks [101], to visualize latent variables [102], clustering genetic variables [103], and the above-mentioned SHAP Clustering. XAI is a challenging subject, and we argue that visualization approaches such as multidimensional projections can improve explainability systems because visualization can produce high-informative and human-centered explanations [104]. The interested reader can consult Arrieta et al. [91] and Adadi and Berrada [104] surveys for a detailed view of XAI theory.

Real Data Applications
In order to illustrate the performance of some of the exposed multidimensional projection algorithms, two different applications were chosen: The clothing E-Commerce text mining task and the DNA microarray Cancer Classification task. It is essential to mention that some embedding techniques are based on distance metrics, such as Euclidean, Manhattan, Cosine, Minkowski, Jaccard, and Chebyshev. Nonetheless, for illustration, we adopted only the Euclidean distance based on the Classical MDS, Force Scheme, LAMP, t-SNE, and UMAP; in addition, compared to the PCA, which used the observation value itself. The first example was solved using R software [105] and the second through Python [106].

Text Mining Data
The first motivation was adopted from the KAGGLE website, which is related to the Women's Clothing E-Commerce database. These data were anonymous recorded reviews derived from customers, which its dataset shows ten features (Clothing ID, reviewers Age, Title, Review Text, Rating, Recommended IND, Positive Feedback Count, Division Name, Department Name, and Class Name). Further details can be found at [107].
After pre-processing steps such as removing numbers, punctuation, and stopwords and transforming them all into lowercase, the frequency of the terms was calculated. Figure 5 shows the description of the Review Text feature, a suitable analysis that helps verify words' importance and their relation with the recommendations. In addition to these analyses, the multidimensional projection algorithms can cluster observations alike embedded in 2D. Figure 6 shows the MP exemplification considering the review from the item Jacket only, adopting the algorithms PCA, MDS, Force Scheme, LAMP, t-SNE, and UMAP, whereas the blue words are used in positive comments, and the red used in negative comments. Whereas these engines are helpful in monitoring and tracing some characteristics that can maximize the rating from the clothing store. In this case, it is challenging to identify clusters between positive and negative comments. Thus, other types of features could be considered to represent that segregation. On the other hand, one can analyze isolated clothing classes to determine the segregation between positive and negative comments. For instance, by selecting PCA visualization and adopting the package factoextra, the text reviews can be seen through their similarity (in Figure 7). Selecting the extreme values on the top right (ID #8551) versus the bottom right (ID #12374), it is easy to detect we have a positive review versus a negative one. Moreover, those obtained patterns can unravel other characteristics across the formed clusters. Review ID #8551-"This brand makes the best jackets. I got a vest from them last year and an olive moto jacket. Both fit amazing. This one is no different. It fits great! I love that it is a great casual jacket with structure and drape. I cannot say enough good things about this Jacket. Every time I wear it, I feel so pulled together. Fun light gray color that you do not see too often in jackets. This is a jacket I will see in my wardrobe for years to come. must-have".
Review ID #12374-"This is a very loose-fitting style jacket. If you like your jackets oversized, get your regular size; if you prefer the slightly loose (but not baggy) fit like the product photo, size down. As noted by another reviewer, the sleeves are incredibly long. I would say if you are a typical size 0 or 2 (unless you like the oversized fit), you are likely sized out of this Jacket. It is a shame because the quality is great, and the windowpane print is beautiful. I am sadly sized out".

Bio-Informatics (Cancer Classification Data)
Golub et al. [108] discussed gene expression monitoring (via DNA microarray) related to patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). This dataset targets identifying new cancer classes and assigning tumors to known classes, though it deals with high-dimensional data. Dataset and further details can be found at [109]. The adopted dataset presents 7129 observations, showing 78 gene expressions (features). Figure 8 shows the correlation matrix across the 78 gene expressions. Moreover, Figure 9 presents six multidimensional projection techniques (PCA, MDS, Force Scheme, LAMP, t-SNE, UMAP), implemented in Python, to embed characteristics in two dimensions. The specialist can combine those finds derived and combined from all of them. One can say those projections can be complementary, and by using visual data mining, hidden patterns may be found.

Challenges, Open Questions, and Future Research Directions
Given visual analytics' relevance as a significant part of mining tools in data science applications, multidimensional projection techniques are a powerful resource due to their ability to generate visual representations of complex data instances groupings belonging to high-dimensional spaces. However, significant gaps still need to be addressed in the multidimensional projection context.
The trade-off between computational costs and precision is one of the significant challenges regarding the development and application of multidimensional projections. Because of the increasingly large datasets we are facing nowadays, both in terms of the number of instances and the number of attributes per instance (dimensionality), the development of computationally efficient and accurate techniques is a challenging issue that involves research in sophisticated mathematical solutions. This challenge is intensified when applications need to handle real-time or growing datasets, such as time series and streaming data.
An open opportunity in the projection domain is to handle time-varying data, i.e., how to project multidimensional time series. Research in multidimensional time series has a long and rich history [110], but properly exploring the time component imposes a major challenge in the projection domain. Though it is possible to project time slices of the data and analyze the evolution of these time slices, it can easily become an overwhelming task when a long time series is considered. The literature is not so rich regarding multidimensional projection methods able to capture dynamic data evolving [111]. We can cite some extensions of well-known methods such as t-SNE [112][113][114] and even PCA [115,116].
Real-time data also highlights an issue already identified by the research community in multidimensional projections: stability. A stable method must be able to incorporate new data in the visual space without significant modifications in the position of points already projected. In addition, in a scenario with continuously generated data, it might be too expensive to recalculate the entire projection so that the number of instances is increased/decreased. Methods that rely on matrix decomposition are not stable since the addition/removal of a single instance demands the recalculation of all matrix operations, which can significantly change the mapping [7]. Optimization-based techniques depend on initial conditions, so there is no way to guarantee their stability.
Control points-based strategies are more resilient alternatives to perturbations, as long as the control points are kept fixed. However, these kinds of approaches open other questions. As mentioned before, the random sampling of control points has been a commonly applied solution. However, there is no guarantee that a random sample encompasses all the nuances of a dataset (besides may render some confusion when generating different layouts). Another issue we raised in this context is related to the sampling size. How many control points are needed to compose a good sample? Thus, these open questions are waiting for future research that establishes metrics and parameters for the generation of the initial sample subsets.
Most of the multidimensional projection techniques were tailored to operate on numerical data. Thus, we identified a gap between projection tools in their ability to handle non-numerical values, such as ordinal, textual, categorical data [117], and so on. For example, one-hot-encoding is a well-known method applied for a long time in digital circuits and machine learning environments that can be used in our context to transform categorical variables into numerical combinations of binary nature. However, there is still work to properly deal with non essentially numerical variables, although some classic strategies can be applied to alleviate this deficiency, such as one-hot-encoding.
One major challenge of multidimensional projection techniques, in our opinion, is interpretability. In practical terms, there is a lack of information transmission when a user with little or no additional training analyzes the result of a multidimensional projection mapping. Although some studies have already worked on improving scatter plots' perceptual capabilities [118][119][120][121], only the presentation of points or clouds of points into scatter plots does not effectively inform the user about the complex relationships between data instances. What features on the m-dimensional space lead to the produced mapping on the p-dimensional space? As we can see, there is a demand for layouts with designs that properly explore the users' graphic perception abilities, enriched with statistical information, and interactive tools that enable the user to conduct an exploratory visual analysis. There is a clear need for enrichment of multidimensional projection mappings, which could lead users to generate insights over the available information, transforming the perception of point distances on the visual space into the decoding of the original (dis)similarities between data instances.

Conclusions
In this study, we discussed the main multidimensional projection techniques and, whenever possible, their more applied derivations. Our taxonomy and formalization include definitions not only of the covered techniques of this research but also seminal concepts about application domains, the pre-processing data tasks, and the limitations that multidimensional projection techniques addresses. Furthermore, the ability to preserve data similarity while generating visual representations renders multidimensional projection techniques highly informative as a visualization mechanism that can enrich many problem-solving and decision-making applications [122,123]. Thus, we believe that multidimensional projection is an effervescing research area with a considerable applicabil-ity potential that comes to contribute to the existing exploratory tools in a wide range of domains within the data sciences. Nevertheless, many important issues are still open to be explored, and we pointed out several scenarios and directions where multidimensional techniques can evolve in future works.