Dynamical Analysis of the Dow Jones Index Using Dimensionality Reduction and Visualization

Time-series generated by complex systems (CS) are often characterized by phenomena such as chaoticity, fractality and memory effects, which pose difficulties in their analysis. The paper explores the dynamics of multidimensional data generated by a CS. The Dow Jones Industrial Average (DJIA) index is selected as a test-bed. The DJIA time-series is normalized and segmented into several time window vectors. These vectors are treated as objects that characterize the DJIA dynamical behavior. The objects are then compared by means of different distances to generate proper inputs to dimensionality reduction and information visualization algorithms. These computational techniques produce meaningful representations of the original dataset according to the (dis)similarities between the objects. The time is displayed as a parametric variable and the non-locality can be visualized by the corresponding evolution of points and the formation of clusters. The generated portraits reveal a complex nature, which is further analyzed in terms of the emerging patterns. The results show that the adoption of dimensionality reduction and visualization tools for processing complex data is a key modeling option with the current computational resources.


Introduction
Complex systems (CS) are composed of several autonomous entities, described by simple rules, that interact with each other and their environment. The CS give rise to a collective behavior that exhibits a much richer dynamical phenomena than the one presented by the individual elements. Often, CS exhibit evolution, adaptation, self-organization, emergence of new orders and structures, long-range correlations in the time-space domain, chaoticity, fractality, and memory effects [1][2][3][4]. The CS are not only pervasive in nature, but also in human-related activities, and include molecular dynamics, living organisms, ecosystems, celestial mechanics, financial markets, computational systems, transportation and social networks, and world and country economies, as well as many others [5][6][7][8].
Time-series analysis has been successfully adopted to study CS [9,10]. The CS outputs are measured over time and the data collected are interpreted as manifestations of the CS dynamics. Therefore, the study of the time-series allows for conclusions about the CS behavior to be reached [11,12]. Nonetheless, real-word time-series may be affected by noise, distortion and incompleteness, requiring advanced processing methods for the extraction of significant information from the data [13]. Information visualization plays a key role in time-series analysis, as it provides an insight into the data characteristics. Information visualization corresponds to the computer generation of dataset visual representations. Its main goal is to expose features hidden in the data, in order to understand the system that generated such data [14,15]. Dimensionality reduction [16] plays a key role in information visualization, since the numerical data are often multidimensional. Dimensionality reduction-based schemes try to preserve, in lower-dimensional representations, the information present in the original datasets. They include linear methods, such as classic multidimensional scaling (MDS) [17], principal component [18], canonical correlation [19], linear discriminant [20] and factor analysis [21], as well as nonlinear approaches, such as non-classic MDS, or Sammon's projection [22], isomap [23], Laplacian eigenmap [24], diffusion map [25], t-distributed stochastic neighbor embedding (t-SNE) [26] and uniform manifold approximation and projection (UMAP) [27].
Financial time series have a complex nature and their dynamic characterization is challenging. The Dow Jones Industrial Average (DJIA) is an important financial index and is adopted in this paper as a dataset generated by a CS. The paper explores an alternative strategy to the classical time-domain analysis, by combining the concepts of distance and dimensionality reduction with computational visualization tools. The DJIA time-series of daily close values is normalized and segmented, yielding a number of objects that characterize the DJIA dynamics. These objects are vectors, whose time-length and partial time-overlap represent a compromise between time resolution and memory length. The objects are compared using various distances and their dissimilarities are used as the input to different dimensionality reduction and information visualization algorithms, namely hierarchical clustering (HC), MDS, t-SNE and UMAP. The aforementioned algorithms construct representations of the original dataset, where time is a parametric variable. The structure of the plots is further analyzed in terms of the emerging patterns. The formation of clusters and the evolution of the patterns over time maps a dynamical behavior with discontinuities for periods where the memory is somehow lost. Numerical experiments illustrate the feasibility and effectiveness of the method for the processing of complex data.
The paper organization is summarized as follows. Section 2 reviews mathematical fundamental concepts, namely the distances and the algorithms adopted in the study for processing and visualizing data. Section 3 introduces the DJIA dataset. Section 4 analyses the data and interprets the results in the light of the distances used. Section 5 assesses the effect of the time-length and overlap of the segmenting window. Section 6 presents the conclusions.

Distances
Given two points v i and v j in a set X , the function d(v i , v j ) : X × X → [0, +∞] represents a distance between the points if, and only if, it satisfies the conditions: identity of indiscernibles, symmetry and triangle inequality [28].
Other techniques [29] and distances [28] can also be adopted to compare the data. However, a more extensive overview and utilization of a larger number of alternatives is out of the scope of the paper.

Dimensionality Reduction and Visualization
In the next subsections, the dimensionality reduction and visualization techniques that are adopted for data processing are presented, namely the HC, MDS, t-SNE and UMAP.
Given a set of N objects, v i , i = 1, . . . , N, in space P, all methods require the definition of a distance d(v i , v j ), i, j = 1, . . . , N, between the objects i and j.

The Hierarchical Clustering
The HC groups similar objects and represents them in a 2-dim locus. The algorithm involves two main steps [30]. In the first, the HC constructs a matrix of distances, In the second step, the algorithm arranges the objects in a hierarchical structure and depicts them in a graphical portrait, namely, a hierarchical tree or a dendrogram. This is achieved by means of two alternative techniques: the divisive and the agglomerative procedures. In the divisive scheme, all objects start in one single cluster and end in separate clusters. This is done by iteratively removing the 'outsiders' from the least cohesive cluster. In the agglomerative scheme, each object starts in its own cluster and all end in one single cluster. This is accomplished by successive iterations that join the most similar clusters. The HC requires the specification of a linkage criterion for measuring the dissimilarity between clusters. Often, the average-linkage, d av (R, S), is adopted [31], where R and S represent two clusters. Therefore, denoting d(v R , v S ) the distance between a pair of objects v R ∈ R and v S ∈ S, in the clusters R and S, respectively, we have: The reliability of the clustering can be assessed by the cophenetic coefficient cc where {v i , v j } and {t i , t j } stand for the original objects and their HC representations, respectively, av(·) denotes the average of the input argument, andd(t i , t j ) represents the cophenetic distance between t i and t j . We always obtain 0 ≤ cc ≤ 1, with the limits corresponding to bad and good clustering, respectively. Additionally, the original and the cophenetic distances can be represented in a scatter plot denoted by Shepard diagram. A good clustering corresponds to points located close to a 45 • line.

The Multidimensional Scaling
The MDS includes a class of methods that represent high-dimensional data in a lower dimensional space, while preserving the inter-point distances as much as possible. The matrix D = [d(v i , v j )] feeds the MDS dimensionality reduction and visualization algorithm. The MDS tries to find the positions of Q-dimensional objects, t i , with i = 1, . . . , N, represented by points in space Q, so that Q ≤ P, while producing a matrix T = [d(t i , t j )] that approximates D. This is accomplished by means of an optimization procedure that tries to minimize a fitness function. Usually, the stress cost function, S, is adopted The Sammon criterion is an alternative, yielding The 'quality' of the MDS is assessed by comparing the original and the reproduced information. This can be accomplished by means of the Shepard diagram, which depicts d(v i , v j ) versusd(t i , t j ). Additionally, since the stress S decreases monotonically with the dimension Q, the user can establish a compromise between the two variables. Often, the practical choice is Q = 2 or Q = 3, since those values yield a direct graphical representation in the embedding space. Nevertheless, if the MDS locus is unclear, then the user must adopt another measure d(v i , v j ) until a suitable representation is obtained.
The algorithm comprises two main stages. In the first, for each pair of objects (v i , v j ), i, j = 1, . . . , N, the t-SNE constructs a joint probability distribution p ij measuring the similarity between v i and v j , in such a way that similar (dissimilar) objects are assigned a higher (lower) probability where p ij = p ji , p ii = 0, ∑ i,j p ij = 1 and ∑ j p j|i = 1, ∀i, j. The parameter σ 2 i represents the variance of the Gaussian kernel that is centered on v i . A particular value of σ i induces a probability distribution P i , over all of the other datapoints. In other words, P i represents the conditional probability distribution over all other datapoints given the datapoint v i . The t-SNE searches for the value of σ i that generates a distribution P i with the value of perplexity specified by perplexity where H(P i ) is the Shannon entropy of P i As a result, the variation in the Gaussian kernel is adapted to the density of the data, meaning that smaller (larger) values of σ i are used in denser (sparser) parts of the data space. The perplexity can be interpreted as a smooth measure of the effective number of v i neighbors. Typical values of perplexity(P i ) are in the interval [5,50].
In the second stage, the t-SNE calculates the similarities between pairs of points in Q where the symbol || · || denotes the 2-norm of the argument, q ij = q ji , q ii = 0, ∑ i,j q ij = 1 and ∑ j q j|i = 1, ∀i, j. The t-SNE performs an optimization, while attempting to minimize the Kullback-Leibler (KL) divergence between the Gaussian distribution of the points in space P and the Student t-distribution of the points in the embedding space Q: The minimization scheme starts with a given initial set of points in Q, and the algorithm uses the gradient descent The KL divergence between the modeled input and output distributions is often used as a measure of the quality of the results.

The Uniform Manifold Approximation and Projection
The UMAP is a recent technique [27] for clustering and visualizing high-dimensional datasets, which seeks to accurately represent both the local and global structures embedded in the data [38,39].
Given a distance, d(v i , v j ), between pairs of objects v i and v j , i, j = 1, . . . , N, and the number of neighbors to consider, k, the UMAP starts by computing the k-nearest neighbors of v i , N i , with respect to d(v i , v j ). Then, the algorithm calculates the parameters ρ i and σ i , for each datapoint v i . The parameter ρ i represents a nonzero distance from v i to its nearest neighbor and is given by The parameter ρ i is important to ensure the local connectivity of the manifold. This means that it yields a locally adaptive exponential kernel for each point.
The constant σ i must satisfy the condition determined using binary search. The algorithm constructs a joint probability distribution p ij measuring the similarity between v i and v j , in such a way that similar (dissimilar) objects are assigned a higher (lower) probability where In the second stage, the UMAP computes the similarities between each pair of points in the space where q ij = q ji , q ii = 0, ∑ i,j q ij = 1 and ∑ j q j|i = 1, ∀i, j. The constants a, b ∈ R are either user-defined or are determined by the algorithm given the desired separation between close points, δ ∈ R + , in the embedding space Q The UMAP performs an optimization while attempting to minimize the cross-entropy CE between the distribution of points in P and Q The minimization scheme starts with a given initial set of points in Q. The UMAP uses the Graph Laplacian to assign initial low-dimensional coordinates, and then proceeds with the optimization using the gradient descent

Description of the Dataset
The prototype dataset representative of a given CS corresponds to the DJIA daily closing values from 28 December 1959 up to 12 March 2021. Each week includes 5 working days. Occasional missing data are obtained by means of linear interpolation. The resulting time series x = {x k : k = 1, . . . , L} comprises L = 15970 values, x k , covering approximately half a century.
Often, we pre-process x in order to reduce the sensitivity to a high variation in the numerical values, yieldingx = {Φ q (x k ) : k = 1, . . . , L}. Functions Φ q (·), which are commonly adopted, are the logarithm of the values, the logarithm of the returns and the normalization by the arithmetic mean, av(x), and the standard deviation, σ(x), given by Figure 1 depicts the evolution of x, as well as the logarithm of the returnsx = {Φ 2 (x k ) : k = 1, . . . , L}, which reveals a fractal nature. We verify the existence of 13 main periods denoted from A to M. For k ∈ [1, 640], corresponding to the periods A and B, the values of x k are small, starting with a decrease, followed by a recovering trend. This behavior is followed by a sustainable increase in the DJIA during k ∈ [640, 1555], period C. The interval k ∈ [1555, 5890] corresponds to the periods D, E and F , which are characterized by an overall stagnation in the between of severe crises. For k ∈ [5890, 7237], that is, period G, we have an important rising trend, interrupted abruptly, but rapidly recovered, marking the beginning of period H for k ∈ [7237, 10,340]. For k ∈ [10,340, 11,240], corresponding to period I, the DJIA reveals a decreasing trend. This behavior is followed by the period J , during the interval k ∈ [11,240, 12,500], characterized by a sustained increase in the DJIA values. For k ∈ [12,500, 12,840], the period K reveals a strong falling trend. Then, recovery initiates and a rising trend is verified during the period L, that is, for k ∈ [12,840, 15,690]. This period is interrupted suddenly, but rapidly, recovered, signaling the beginning of M, corresponding to k ∈ [15,690,15,970]. Table 1   To assess the dynamics of the DJIA, the time-seriesx is segmented into where W is the window length, α ∈ [0, 1] stands for the window overlapping factor, and · denotes the floor function. Therefore, the ith, i = 1, . . . , N, window consists of the vector Figure 2 portraits the histogram ofx = {Φ 2 (x k ) : k = 1, . . . , L} for consecutive disjoint windows (α = 0) and W = 60. We verify the existence of fat tails in the statistical distribution, as well as a 'noisy' behavior, which are also verified for other functions Φ q and values of α and W.

Analysis and Visualization of the DJIA
The DJIA time-series x is normalized using expression (34), yieldingx = {Φ 3 (x k ) : k = 1, . . . , L}. Naturally, other types of pre-processing are possible, but the linear transform (34) is common in signal processing [40] and several experiments showed that it yields good results.
In the next subsections,x is segmented using consecutive disjoint (α = 0) time windows of length W = 60 days, which yield N = 266 objects, v i , with i = 1, . . . , N. These objects are processed by the dimensionality reduction and visualization methods, while adopting different distances (1)- (9) to quantify the dissimilarities between objects. For the generalized distance d 10 , given by expression (10), since no a priori preference for a given formula is set, we adopt identical weights, that is, λ r = 1 9 , r = 1, . . . , 9. The values of α and W were chosen experimentally. Obviously, other values could have been adopted, but those used lead to a good compromise between time resolution and suitable visualization.

The HC Analysis and Visualization of the DJIA
The neighbor-joining method [41] and the successive (agglomerative) clustering using average-linkage are adopted, as implemented by the software Phylip [42] with the option neighbor. Figure 3 depicts the HC trees with the distances d 2 , d 3 , d 5 and d 10 . The circular marks correspond to objects (window vectors) and the colormap represents the arrow of time. We verify that the HC has difficulty in separating the periods A-F and, for distance d 5 , this difficulty is also observed for the periods H-J . For other distances, we obtain loci of the same type.  The HC loci reflect the relationships between objects, but the interpretation of such loci is difficult due to the presence of many objects and because we are constrained to 2-dim visual representations. The reliability of the clustering, that is, how well the hierarchical trees reproduce the original dissimilarities of the original objects in the dataset, was verified. Nevertheless, we do not include the Shepard diagrams for the sake of parsimony.

The MDS Analysis and Visualization of the DJIA
We now visualize the DJIA behavior using the MDS. The Matlab function mdscale with the Sammon nonlinear mapping criterion is adopted. Figure 4 depicts the 3-dim loci obtained for α = 0 and W = 60 (N = 266) with the distances d 2 , d 3 , d 5 and d 10 .
The reliability of the 3-dim loci was verified through the standard Shepard and stress plots, which showed that the objects in the embedding space Q reproduce those in the original space P. Those diagrams are not depicted for the sake of parsimony. We verify that the MDS unravels patterns compatible with the DJIA 13 periods A-M. However, the algorithm cannot discriminate between them. The patterns are composed by two 'segments' formed by objects that reveal an almost continuous and smooth evolution in time. Each segment translates into a DJIA dynamics exhibiting strong memory effects that are captured by the visualization technique with the adopted distance. The transition between segments corresponds to some discontinuity where the memory of past values is somehow lost.  For other distances, we obtain loci of several types. However, it should be noted that often the definition of an adequate distance (in the sense of assessing the dynamical effects) necessitates some numerical trials. Different distances can lead to valid visual representations, but may be unable to capture the features of interest. For example, the correlation distance, d 11 , given by correlation : leads to the loci shown in Figure 5, revealing that neither the HC nor the MDS can capture the memory effects embedded in the dataset.

The t-SNE Analysis and Visualization of the DJIA
The Matlab function tsne was adopted to visualize the datasetx = {Φ 3 (x k ) : k = 1, . . . , L}. The algorithm was set to exact and the value 5 was given to the Exaggeration and the Perplexity. These values were adjusted by trial in order to obtain good visualization. The Exaggeration corresponds to the size of natural clusters in data. A large exaggeration creates relatively more space between clusters in the embedding space Q.
The Perplexity is related to the number of local neighbors of each point. All other parameters kept their default values. Figure 6 depicts the 3-dim loci obtained for the distances d 2 , d 3 , d 5 and d 10 . The loci reveal that the t-SNE can arrange objects according to their the periods A-M and that the plots generated with the different distances are similar.

The UMAP Analysis and Visualization of the DJIA
For implementing the UMAP dimensionality reduction and visualization, we adopted the Matlab UMAP code, version 2.1.3, developed by Stephen Meehan et al. [43]. The function run_umap was used with parameters n_neighbors and min_dist set to 5 and 0.2, respectively, adjusted by trial and error in order to obtain good visualization. These parameters correspond directly to k and δ introduced in Section 2.2.4. All other parameters are set to their default values. Figure 7 depicts the 3-dim loci obtained for the distances d 2 , d 3 , d 5 and d 10 . The UMAP can organize objects in Q according to their characteristics, identifying well the periods A-M, independently of the adopted distance. Therefore, we conclude that both the t-SNE and the UMAP perform better than the MDS in representing the DJIA dynamics. The visualization has only slight variations with the distance adopted to compare objects.

Assessing the Effect of W and α in the Visualization of the DJIA Dynamics
The window width and overlap, W and α, represent a compromise between time resolution and memory length. In this section, we study the effect of these parameters on the patterns generated by the HC, MDS, t-SNE and UMAP. The analysis was performed for all distances and several combinations of W and α. The results are presented for the Canberra distance, d 2 , and the cases summarized in Table 2, where W = {90, 60, 30, 10} and α = {0, 0.2, 0.5}. For other distances, we obtain similar conclusions. Regarding the HC, we verify that the loci are quite insensitive to the parameter W, with the exception of W = 10. For this value of window length, the HC can discriminate objects in the periods A-F , despite the fact that capability depends on the overlap α. For W = 10 and α = 0.5, the objects in A-F spread out in space, but their clusters are still unclear. Concerning the MDS, besides the density of objects, which, naturally, varies with N, the 3-dim loci are almost invariant with respect to the parameters W and α. The t-SNE and UMAP reveal a superior ability to generate patterns that correspond to dissimilarities between objects and, therefore, are able to identify the 13 periods A-M. However, for the t-SNE, this ability is weakened as the number of objects increases, N, meaning small values of W and high values of α. For such cases, the generated loci are difficult to interpret. The UMAP reveals the 13 periods A-M for all combinations of W and α. Moreover, for small values of W several sub-periods are unraveled, which directly relate to the time evolution of the DJIA.     Figure 11. The 3-dim loci obtained for d 2 and W = 10: (a) HC and α = 0 (E 10 ); (b) HC and α = 0.2 (E 11 ); (c) HC and α = 0.5 (E 12 ); (d) MDS and α = 0 (E 10 ); (e) MDS and α = 0.2 (E 11 ); (f) MDS and α = 0.5 (E 12 ); (g) t-SNE and α = 0 (E 10 ); (h) t-SNE and α = 0.2 (E 11 ); (i) t-SNE and α = 0.5 (E 12 ); (j) UMAP and α = 0 (E 10 ); (k) UMAP and α = 0.2 (E 11 ); (l) UMAP and α = 0.5 (E 12 ).

Conclusions
This paper explored a strategy representing an alternative to the classical time analysis in the study multidimensional data generated by CS. The DJIA index of daily closing values from 28 December 1959 up to 12 March 2021 was adopted for the numerical experiments. In the proposed scheme, the original time-series was normalized and segmented, yielding a number of objects. These objects are vectors, whose dimension and overlap represent a compromise between time resolution and memory length. The objects were compared using various distances and their dissimilarities are used as the input to the four dimen-sionality reduction and information visualization algorithms, namely, HC, MDS, t-SNE and UMAP. These algorithms construct representations of the original dataset, where time is a parametric variable, with no a priori requirements. The algorithms are based on the minimization of the difference between the original and approximated data. The plots were analyzed in terms of the emerging patterns. Those graphical representations are composed of a number of 'segments', formed by objects with an almost continuous evolution in time, interlaid, eventually, by some discontinuities. This translates into the DJIA dynamics that depicts phases with visible correlation. Consequently, memory effects and transitions corresponding to some discontinuities where the memory of past values is not present. Numerical experiments illustrated the feasibility and effectiveness of the method for processing complex data. The approach can be easily extended to deal with more features and richer descriptions of the data involving a higher number of dimensions.