Next Article in Journal
The Influence of Stress Concentration and Unloading on Deep Rock Damage and Rockburst Process: A Numerical Study
Previous Article in Journal
Multi-Strategy Improved Teaching–Learning-Based Optimization for Global Optimization and Real-World Engineering Problems
Previous Article in Special Issue
The k-Nearest-Neighbor Smoothing Estimator for Functional Least Absolute Relative Error Regression
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Goodness-of-Fit Framework for Assessing Distributional Symmetry and Tail Asymmetry in Financial Equity Markets

by
Abdullah Sevin
1,* and
Alpha Abdoulaye Bah
2
1
Department of Computer Engineering, Sakarya University, 54050 Sakarya, Türkiye
2
Department of Data Science and Artificial Intelligence, Sakarya University, 54050 Sakarya, Türkiye
*
Author to whom correspondence should be addressed.
Symmetry 2026, 18(6), 943; https://doi.org/10.3390/sym18060943
Submission received: 23 April 2026 / Revised: 12 May 2026 / Accepted: 27 May 2026 / Published: 30 May 2026

Abstract

The assumption that highly correlated financial assets share identical risk profiles often overlooks crucial distributional asymmetries. This study introduces a Goodness-of-Fit (GoF) framework to evaluate stochastic symmetry and structural alignment of equity returns. Moving beyond linear correlation, we apply non-parametric GoF tests—Kolmogorov–Smirnov, permutation-based Anderson–Darling, and Epps–Singleton—complemented by Energy Distance metrics, Extreme Value Theory (EVT) for 1% and 5% tail asymptotics, and robust L-moments to quantify tail asymmetry. We analyze major stocks against market indices and sectoral ETFs using ARMA-GARCH filtered innovations to isolate IID components. Our findings reveal a significant decoupling between correlation and stochastic symmetry; highly correlated assets frequently exhibit tail asymmetry and structural drift. Energy Distance decomposition isolates shape-driven deviations from scale-driven volatility. Furthermore, hierarchical clustering categorizes assets into distinct risk profiles, bridging structural divergence and left-tail risk. A 1000-iteration bootstrapped backtest shows that integrating our GoF framework with tail-risk penalties improves risk-adjusted performance, evidenced by superior Sharpe ratios (outperforming 80.3% of random allocations). In conclusion, high linear correlation does not guarantee distributional symmetry. The proposed framework offers deeper insights into asymmetric asset behavior than conventional second moment metrics, providing a robust tool for portfolio risk management under non-Gaussian market conditions.

1. Introduction

Goodness-of-Fit (GoF) testing represents a fundamental pillar of statistical inference, providing rigorous methodologies to assess whether observed data align with hypothesized probability distributions [1]. In applied disciplines—particularly economics, finance, and risk management—distributional assumptions and stochastic symmetry underpin critical decisions, from portfolio construction to regulatory capital calculations. The validity of these assumptions, especially regarding distributional symmetry, directly impacts model performance, risk assessment accuracy, and ultimately, economic outcomes. Consequently, selecting appropriate GoF tests to identify potential asymmetric deviations and interpreting their results correctly is essential for robust empirical analysis.
The statistical literature offers a diverse array of GoF methodologies, each with distinct theoretical foundations and sensitivity profiles. Classical approaches include frequency-based tests such as the Chi-square statistic, empirical distribution function (EDF) tests like Kolmogorov–Smirnov and Anderson–Darling, moment-based diagnostics including Jarque–Bera and D’Agostino–Pearson, and characteristic function-based methods such as the Epps–Singleton test. As emphasized by [2], no single test is uniformly most powerful against all alternatives; rather, each excels at detecting specific types of departures—whether in location, scale, skewness, kurtosis, or extreme tail asymmetry. This complementarity, however, does not extend to capturing the joint dependence structure between multiple assets, which is essential for portfolio risk assessment.
A separate but equally important strand of the literature addresses this gap through copula theory, which has become the standard framework for modeling multivariate tail dependence in finance [3,4]. Unlike univariate GoF tests, copulas allow the separate specification of marginal distributions and the dependence structure, making them particularly suited for analyzing joint extreme events such as simultaneous market crashes. Recent applications in financial risk management demonstrate that vine copulas can capture complex asymmetric dependencies across multiple assets [5], while time-varying copulas extend this framework to dynamic settings [6]. While copula-based methods are essential for modeling multivariate tail dependence, the present study focuses on univariate distributional similarity and tail asymmetry.
With this foundation, the primary motivation for this study stems from the recurrent failures of traditional, correlation-based diversification strategies during periods of market stress. When financial markets experience severe drawdowns, linear correlations often converge toward one, rendering classical mean-variance frameworks highly vulnerable. This vulnerability exposes a critical oversight in modern finance: the assumption that assets moving together linearly (high correlation) share identical risk distributions. In reality, investors need to understand not just average co-movements but whether extreme downside risks (tails) and structural distributions behave symmetrically. Consequently, in financial applications, GoF testing assumes particular importance due to the well-documented non-Gaussian characteristics of asset returns. Empirical evidence consistently demonstrates that financial returns exhibit structural asymmetry, heavy tails, and volatility clustering—features that challenge traditional Gaussian-based models. To rigorously test these features and prevent spurious rejections, it is imperative to isolate independent and identically distributed (IID) innovations, typically via ARMA-GARCH filtering, before applying distributional diagnostics. Moreover, the widespread use of benchmark-oriented investing and sector-based portfolio construction relies implicitly on distributional assumptions that are rarely tested statistically. Validating whether individual stocks or sector ETFs exhibit stochastic symmetry and return distributions consistent with broader market benchmarks is crucial for accurate risk attribution, effective diversification, and reliable performance evaluation under increasingly complex market conditions.
The dynamic nature of financial markets necessitates a shift beyond traditional linear approaches in the analysis of asset returns. Ref. [7] emphasizes that inaccuracies in financial risk estimations are fundamentally rooted in the misidentification of the distributional properties of data, highlighting the critical role of data characterization prior to modeling. Similarly, ref. [8] argues that traditional linear models fail to adequately capture asymmetric dependencies and heavy-tailed structures prevalent among financial assets. These stylized facts of financial return series—most notably excess kurtosis and volatility clustering [9]—remain the primary source of parametric model inadequacies, a concern recently revitalized by [10]. Furthermore, ref. [11] demonstrates that traditional risk measures often fail to capture the complexity of asset interdependencies, necessitating more robust distributional testing.
Driven by these challenges, this paper seeks to address three primary research questions: (1) To what extent do highly correlated financial assets diverge in their underlying distributional symmetries and left-tail risk profiles? (2) How can practitioners systematically isolate and measure these non-linear structural deviations while bypassing the distorting effects of volatility clustering? and (3) Does the integration of these advanced non-parametric distributional metrics into portfolio construction yield economically significant improvements in risk-adjusted returns compared to traditional random allocation?
To this end, we propose a Goodness-of-Fit (GoF) diagnostic framework that evaluates distributional similarity and stochastic symmetry across multiple critical dimensions. First, it provides a structured review and comparison of major GoF test families, clarifying their theoretical foundations, implementation requirements, and diagnostic strengths in detecting asymmetric deviations within a unified mathematical framework. Second, it demonstrates the practical application of these tests through a comprehensive multi-stage empirical analysis: (1) assessing distributional symmetry between individual equities and market benchmarks using ARMA-GARCH filtered IID innovations; (2) evaluating sectoral stochastic alignment via Energy Distance metrics on the same IID-filtered residuals; (3) capturing extreme tail asymmetry by estimating 1% and 5% critical tail asymptotics via Extreme Value Theory (EVT) alongside robust L-moments; and (4) synthesizing these structural and tail metrics via hierarchical clustering to categorize assets into distinct risk profiles. Finally, (5) we demonstrate the economic significance of this framework through a bootstrapped portfolio backtest, proving that integrating GoF metrics with tail-risk penalties effectively minimizes Expected Shortfall (ES) and improves risk-adjusted returns. By integrating multiple complementary tests—including Kolmogorov–Smirnov, Cramér–von Mises, Epps–Singleton, and Energy Distance metrics—we offer a multidimensional diagnostic approach that transcends the limitations of single-test methodologies in capturing non-Gaussian market dynamics.
The remainder of this paper is organized as follows. Section 2 reviews the theoretical foundations of the employed GoF tests. Section 3 presents the comprehensive empirical framework and results. This section is subdivided to systematically cover data description and IID filtering, benchmark and sectoral distributional alignment, tail risk dynamics, risk profiling via hierarchical clustering, and ultimately, the economic significance of the framework through bootstrapped portfolio backtesting. Finally, Section 4 concludes with methodological reflections and suggestions for future research.

2. Goodness-of-Fit Test Families

Goodness-of-Fit (GoF) testing is used to determine whether a set of observed data is consistent with a specified probabilistic model. Let X 1 , , X n represent an independent and identically distributed sample with cumulative distribution function (CDF) F. GoF procedure evaluates the following hypotheses:
H 0 : F = F 0 versus H 1 : F F 0 ,
where F 0 denotes a specified theoretical distribution. The null hypothesis is said to be simple when all parameters of F 0 are known, and composite when one or more parameters must be estimated from the data. Unlike traditional hypothesis testing, Goodness-of-Fit analysis frequently aims for the non-rejection of H 0 to demonstrate the adequacy of a given distributional assumption [2].
In finance and econometrics, Goodness-of-Fit tests are essential for validating the distributional assumptions that underlie asset pricing, risk measurement, and portfolio optimization. According to the taxonomy proposed by [2], Goodness-of-Fit tests are categorized into several distinct families. These tests can also be classified based on their underlying statistical principles.

2.1. Frequency-Based Tests

Frequency-based tests evaluate the correspondence between observed and expected frequencies. These tests are applied across a finite set of categories.

Chi-Square Goodness-of-Fit Test

The Chi-square Goodness-of-Fit test is appropriate when data are represented as frequency counts across a finite set of categories. Each category must be mutually exclusive, and every observation must be assigned to a single category. The observations are assumed to be independent, and the null hypothesis defines a probabilistic model that ensures the validity of the test statistic’s asymptotic properties. Expected frequencies should not be too small [2,12].
Suppose the data are partitioned into k distinct categories. Let O i denote the observed frequency in category i, and E i represent the corresponding expected frequency under the null hypothesis. The Chi-square Goodness-of-Fit statistic is calculated as follows:
χ 2 = i = 1 k ( O i E i ) 2 E i ,
This statistic quantifies the cumulative discrepancy between observed and theoretical frequencies. It does so by summing the standardized squared deviations across all categories. When the null hypothesis is true and the model assumptions are satisfied, the sampling distribution of the test statistic approaches a Chi-square distribution as the sample size increases. The corresponding number of degrees of freedom is given by
ν = k 1 p ,
where p denotes the number of parameters of the theoretical model that are estimated from the observed data [2,13].
The magnitude of the Chi-square statistic represents the degree of discrepancy between the observed frequency distribution and the hypothesized distribution. Higher values provide stronger evidence against the null hypothesis, whereas lower values suggest that observed deviations may reasonably be attributed to random variation. A formal decision rule involves comparing the observed statistic to a critical value derived from the Chi-square distribution at a predetermined significance level [13].
The Chi-square Goodness-of-Fit test is widely employed because of its methodological simplicity, computational efficiency, and ease of interpretation. This test is particularly suitable for discrete data or for continuous variables that have been grouped into categories prior to analysis. Consequently, the test is frequently applied in empirical research to assess conformity to a reference distribution using aggregated frequency information [2,14].
Although widely used, the Chi-square test possesses several significant limitations. The results of the Chi-square test are highly dependent on the selected categorization scheme, which can obscure important features of the underlying distribution. Furthermore, when expected frequencies are low, the asymptotic Chi-square approximation becomes unreliable. In comparison to Goodness-of-Fit procedures that utilize the empirical distribution function, the Chi-square test frequently demonstrates lower sensitivity to localized or tail-specific deviations from the null model [15].

2.2. Empirical Distribution Function (EDF) Based Tests

EDF-based tests evaluate the empirical distribution function F n ( x ) against a hypothesized cumulative distribution function F 0 ( x ) [2] and summarized in Table 1.

2.2.1. Kolmogorov–Smirnov Test

The Kolmogorov–Smirnov test is appropriate for data comprising independent and identically distributed observations that are drawn from a continuous distribution. The null hypothesis defines a specific reference cumulative distribution function F 0 ( x ) , which is assumed to be completely specified. The test is based on the empirical distribution function constructed from the sample. It assumes that ties are absent or negligible since their presence may influence the theoretical distribution of the test statistic [2,15].
Consider a random sample X 1 , , X n with the associated empirical distribution function F n ( x ) . The function F 0 ( x ) denotes the cumulative distribution function specified under the null hypothesis. The Kolmogorov–Smirnov test statistic is defined as follows:
D n = sup x F n ( x ) F 0 ( x ) ,
which represents the maximum absolute deviation between the empirical and theoretical distribution functions over the domain of the data [16,17]. Assuming the null hypothesis and a continuous reference distribution, the sampling distribution of D n does not depend on F 0 ( x ) and converges to the Kolmogorov distribution as the sample size increases [15,17]. This distribution-free property holds only when the parameters of the reference distribution are known a priori.
The magnitude of the Kolmogorov–Smirnov (KS) statistic quantifies the maximum deviation between the observed data and the hypothesized theoretical model. Large values of D n provide evidence against the null hypothesis, while small values indicate stochastic consistency between the sample and the reference distribution. A decision rule is established by comparing the observed statistic to a critical value derived from the Kolmogorov distribution at a specified significance level [15].
The Kolmogorov–Smirnov test is widely utilized due to its independence from data binning requirements, as it operates directly on the empirical distribution function. Its non-parametric nature and distribution-free properties under a fully specified null hypothesis make the test particularly valuable for exploratory data analysis and model validation. Consequently, the test is frequently applied in statistics, finance, and engineering contexts where minimal distributional assumptions are required [2,14].
Although the Kolmogorov–Smirnov test offers several advantages, it is subject to notable limitations within financial diagnostics. Crucially, the test demonstrates reduced sensitivity to discrepancies that occur in the tails of the distribution, being most responsive to deviations that arise near the center. Additionally, when the parameters of the reference distribution are estimated directly from the data, the null distribution of the test statistic is altered, rendering standard critical values invalid. Finally, the Kolmogorov–Smirnov test is less suitable for discrete distributions, where ties can violate its underlying mathematical assumptions [15].

2.2.2. Cramér–Von Mises Test

The Cramér–von Mises test is intended for use with continuous random variables. It is based on a comparison between the empirical distribution function and a specified theoretical cumulative distribution function. Unlike frequency-based tests that require binning, the Cramér–von Mises test analyzes ungrouped data. This approach avoids the information loss that can result from categorizing continuous data. The null hypothesis asserts that the empirical distribution matches the theoretical model under consideration [18,19].
Consider an independent sample X 1 , , X n with the associated empirical distribution function F n ( x ) , where F ( x ) represents the fully specified cumulative distribution function under the null hypothesis. The Cramér–von Mises test statistic is defined as follows [18,19]:
ω 2 = n F n ( x ) F ( x ) 2 d F ( x ) ,
which measures the integrated squared deviation between the empirical and theoretical distribution functions over the entire data support. The value of the Cramér–von Mises statistic quantifies the overall discrepancy between the empirical distribution and the hypothesized model. Higher values indicate more substantial departures from the null hypothesis, while lower values suggest closer stochastic agreement between the empirical and theoretical distributions. Statistical significance is determined by comparing the observed statistic with critical values obtained from its asymptotic distribution or from tabulated approximations [15].
The Cramér–von Mises test is recognized for its sensitivity to deviations throughout the entire distribution. Unlike maximum-distance tests (such as Kolmogorov–Smirnov) that emphasize extreme differences at a single point, it evaluates squared discrepancies across the full range of the distribution. By utilizing the empirical distribution function, the test avoids arbitrary binning choices and provides a more balanced assessment of Goodness-of-Fit. These characteristics make the Cramér–von Mises test particularly suitable for financial applications in which moderate yet systematic structural deviations from the theoretical model are of interest [2].

2.2.3. Anderson–Darling Test

The Anderson–Darling test is designed for application to continuous random variables. It is based on comparing the empirical distribution function with a fully specified theoretical cumulative distribution function. The test assumes that observations are independent and identically distributed. While the null hypothesis traditionally stipulates that the form of the theoretical distribution must be fully specified a priori, some extensions permit parameter estimation. In contrast to frequency-based methods, the Anderson–Darling test operates on ungrouped data and places greater emphasis on discrepancies that occur in the tails of the distribution [2,20].
Let X 1 , , X n denote an independent sample with ordered observations X ( 1 ) X ( n ) , and let F ( x ) denote the cumulative distribution function under the null hypothesis. The Anderson–Darling test statistic is defined as
A 2 = n 1 n i = 1 n ( 2 i 1 ) ln F ( X ( i ) ) + ln 1 F ( X ( n + 1 i ) ) ,
where the logarithmic weighting mathematically amplifies discrepancies between the empirical and theoretical distributions in the extreme tail regions.
The Anderson–Darling statistic measures the overall stochastic deviation of the sample distribution from the hypothesized model. It is particularly sensitive to tail asymmetry and heavy-tailed behaviors. Higher values of the Anderson–Darling statistic provide stronger evidence against the null hypothesis. Statistical significance is determined by comparing the observed statistic to critical values, which are derived from the asymptotic distribution or from tabulated approximations specific to the assumed theoretical distribution [15].
The Anderson–Darling test is widely utilized due to its superior statistical power for detecting deviations in the tails of a distribution, where many traditional Goodness-of-Fit tests (such as Kolmogorov–Smirnov) demonstrate limited sensitivity. This characteristic renders the Anderson–Darling test particularly valuable in financial diagnostics, where extreme values and non-Gaussian tail behaviors are critical for risk analysis and return modeling. Compared to other empirical distribution function-based tests, it provides a more discriminating assessment when tail accuracy and structural symmetry are of primary concern [2].
While the Anderson–Darling test is recognized for its effectiveness in evaluating tail behavior, several practical limitations exist. The test traditionally requires a fully specified continuous distribution, and estimating parameters directly from the sample invalidates standard critical values unless appropriate adjustments or resampling techniques (e.g., permutation-based approaches) are implemented. Published critical values are available for only a limited set of reference distributions, often necessitating custom simulations. Additionally, the test’s aggressive tail weighting may exaggerate minor discrepancies in very large samples. Furthermore, it is not well-suited for discrete data, as the presence of ties violates its underlying mathematical assumptions [15].

2.3. Moment-Based Tests

Moment-based Goodness-of-Fit tests assess the adequacy of a distribution by comparing sample moments to their theoretical values. These tests focus on the alignment between observed sample moments and the moments expected under the hypothesized distribution. These methods (Table 2) are primarily designed to detect departures from normality by evaluating skewness and kurtosis [2].

2.3.1. Jarque–Bera Test

The Jarque–Bera test assumes that the observations constitute an independent and identically distributed sample. The null hypothesis specifies a normal distribution, in which skewness is zero and kurtosis equals three. The test relies on large-sample approximations and is consequently most reliable for moderate to large sample sizes [21].
Let S denote the sample skewness, K the sample kurtosis, and n the sample size. The Jarque–Bera test statistic is defined as [21]
J B = n 6 S 2 + ( K 3 ) 2 4 ,
which combines squared deviations of skewness and kurtosis from their normal theory values. Large values of the Jarque–Bera statistic suggest the presence of significant asymmetry, excess kurtosis, or a combination of both characteristics. Under the null hypothesis of normality, the statistic follows asymptotically a Chi-square distribution with two degrees of freedom.
The Jarque–Bera test is widely adopted due to its computational simplicity and its direct connection to fundamental distributional characteristics. It is particularly popular in econometrics and financial applications, where skewness and the thickness of distribution tails are of primary concern. The test may demonstrate limited sensitivity to localized deviations that do not strongly affect skewness or kurtosis. Furthermore, the reliance on asymptotic theory restricts its effectiveness for small sample sizes.

2.3.2. D’Agostino–Pearson Omnibus Test

The D’Agostino–Pearson omnibus test is based on the assumption that observations are independent and drawn from a continuous probability distribution. The null hypothesis states that the data are normally distributed. In contrast to tests that rely exclusively on raw moments, this procedure applies transformations to measures of skewness and kurtosis in order to enhance performance with finite samples [22].
Let Z S and Z K represent the standardized transformations of sample skewness and kurtosis, respectively. The test statistic is defined as follows:
K 2 = Z S 2 + Z K 2 .
This statistic aggregates evidence against normality that arises from both asymmetry and tail behavior [22]. Large values of the omnibus statistic indicate departures from normality that are driven by skewness, kurtosis, or a combination of these factors. Under the null hypothesis, the statistic asymptotically follows a Chi-square distribution with two degrees of freedom.
The D’Agostino–Pearson test is often preferred over simpler moment-based tests because it demonstrates superior statistical power, particularly for moderate sample sizes. It provides a comprehensive assessment of deviations from normality by accounting for both shape and tail effects. The test is specifically designed for assessing normality and is not suitable for evaluating Goodness-of-Fit with respect to other distributions. As with other moment-based methods, it may fail to detect deviations that significantly affect low-order moments.

2.4. Characteristic Function-Based Tests

Characteristic function-based tests evaluate distributional equality by comparing the empirical characteristic functions of two samples. Unlike moment-based approaches that examine specific moments, these tests capture information about the entire distribution through its characteristic function, providing sensitivity to differences in all moments simultaneously.

Epps–Singleton Test

The Epps–Singleton test evaluates distributional equality by comparing the empirical characteristic functions of two samples. The test does not require distributional assumptions and is applicable to both univariate and multivariate data. Its null hypothesis states that two independent samples are drawn from the same continuous distribution, making it particularly suitable for comparing financial return distributions where normality assumptions rarely hold [23]. The Epps–Singleton test measures the distance between the empirical characteristic functions of two samples. For samples X 1 , , X n and Y 1 , , Y m with empirical characteristic functions ϕ ^ X ( t ) and ϕ ^ Y ( t ) , the test statistic is based on the following transformation:
g X ( t ) = cos ( t X ) sin ( t X ) , g Y ( t ) = cos ( t Y ) sin ( t Y )
The weighted quadratic form test statistic is computed as:
W = n · ( g ¯ X g ¯ Y ) T Σ ^ 1 ( g ¯ X g ¯ Y )
where g ¯ X and g ¯ Y are sample means of g X ( t ) and g Y ( t ) evaluated at selected points t 1 , , t k , and Σ ^ is the estimated covariance matrix:
Σ ^ = n n X Σ ^ X + n n Y Σ ^ Y
with Σ ^ X , Σ ^ Y being the sample covariance matrices. Under the null hypothesis of distributional equality, W follows asymptotically a χ 2 distribution with degrees of freedom equal to twice the number of evaluation points.
The test is implemented with default evaluation points t = ( 0.4 , 0.8 ) , scaled by the semi-interquartile range as recommended in the original literature. For small samples ( n < 25 ), a correction factor is applied:
W corrected = W · 1 1 + n 0.45 + 10.1 ( n X 1.7 + n Y 1.7 )
This implementation provides a robust non-parametric test for distributional equality that is particularly sensitive to differences in higher moments and tail behavior, making it valuable for financial applications where return distributions often deviate from normality.

2.5. Probability Plot Correlation Coefficient (PPCC) Test

The Probability Plot Correlation Coefficient (PPCC) test is based on the assumption that the observations constitute an independent sample drawn from a continuous distribution. Under the null hypothesis, the ordered observations are expected to align approximately linearly with the corresponding theoretical quantiles. The test requires the theoretical distribution to be fully specified. However, extensions have been developed to accommodate parameter estimation [24].
Let X ( 1 ) X ( n ) denote the ordered sample observations, and let Q 1 , , Q n denote the corresponding theoretical quantiles under the null hypothesis. The PPCC statistic is defined as the Pearson correlation coefficient between { X ( i ) } and { Q i } [24]:
r = i = 1 n ( X ( i ) X ¯ ) ( Q i Q ¯ ) i = 1 n ( X ( i ) X ¯ ) 2 i = 1 n ( Q i Q ¯ ) 2 ,
which measures the strength of linear association between empirical and theoretical quantiles.
Correlation coefficient values approaching one indicate a strong agreement with the hypothesized distribution. Lower values indicate deviations from linearity in the probability plot and, consequently, departures from the null hypothesis. Critical values are typically determined through simulation or by using tabulated approximations. The PPCC test serves as a quantitative complement to graphical probability plots. It is particularly valued for its intuitive interpretation and strong power against a wide range of alternatives, especially for normality testing. The method is commonly used in quality control and exploratory data analysis. The test is sensitive primarily to deviations that affect the overall linear structure of the probability plot. It may exhibit reduced sensitivity to localized departures or subtle tail effects. Additionally, its distribution depends on the assumed model and sample size, necessitating simulation-based critical values.

2.6. Shapiro–Wilk Test

The Shapiro–Wilk test assumes that observations are independent and drawn from a continuous distribution. Under the null hypothesis, the data are normally distributed. The test is based on an optimal linear combination of ordered observations and is particularly effective for small and moderate sample sizes [25].
Let X ( 1 ) , , X ( n ) denote the ordered sample observations. The Shapiro–Wilk statistic is defined as [25]
W = i = 1 n a i X ( i ) 2 i = 1 n ( X ( i ) X ¯ ) 2 ,
where the coefficients { a i } depend on the expected values and covariance structure of order statistics from a standard normal distribution. Values of W close to one indicate consistency with normality, whereas smaller values suggest departures from the null hypothesis. Statistical significance is typically assessed using p-values or critical values derived from the sample size.
The Shapiro–Wilk test is widely regarded as one of the most powerful tests for normality, especially for small samples. However, it is limited to normality assessment and can become overly sensitive in very large samples, which are common in financial time series. In such cases, even minor and practically irrelevant deviations may be flagged as significant.

2.7. Information-Theoretic and Distance-Based Tests

Information-theoretic and distance-based Goodness-of-Fit tests evaluate distributional adequacy by quantifying the discrepancy between empirical and theoretical probability measures. Instead of limiting analysis to moments, spacings, or cumulative distributions, these methods rely on divergence measures or metric distances defined on probability spaces. These approaches offer a flexible framework that can detect a broad spectrum of distributional departures [2].

2.7.1. Kullback–Leibler Divergence Test

The Kullback–Leibler (KL) divergence test is based on the assumption that the underlying distribution possesses a density function and that it is possible to obtain an empirical estimate of this function. It is also assumed that the observations are independent. The null hypothesis specifies a completely defined theoretical distribution against which the divergence is measured [26].
Let f ( x ) denote the theoretical density under the null hypothesis and f ^ ( x ) an empirical density estimate. The Kullback–Leibler divergence is defined as [26]
D KL ( f ^ f ) = f ^ ( x ) log f ^ ( x ) f ( x ) d x ,
This metric quantifies the information loss incurred when f is used to approximate the empirical distribution.
Larger divergence values indicate greater dissimilarity between the empirical and theoretical distributions. Values close to zero indicate strong agreement with the null hypothesis. Statistical significance is typically assessed using asymptotic results or resampling- based procedures.
KL-based tests are advantageous due to their strong theoretical foundation in information theory. These tests are particularly useful in model selection, likelihood-based inference, and contexts where interpretability in terms of information loss is desirable. The divergence is asymmetric and can be sensitive to regions where the theoretical density is close to zero. The performance of the test depends on the quality of the density estimation procedure and can degrade in small samples.

2.7.2. Energy Distance Test

The energy distance test is based on the assumption of independent observations and can be applied to both univariate and multivariate datasets. The null hypothesis specifies that the empirical distribution coincides with a specified reference distribution. Explicit density estimation is not required [27].
Let X denote a random variable with empirical distribution and Y a random variable drawn from the theoretical distribution. The energy distance statistic is defined as [27]
E ( X , Y ) = 2 E X Y E X X E Y Y ,
where X and Y are independent copies of X and Y, respectively.
The energy distance is zero if and only if the two distributions are identical. Larger values indicate increasing separation between the empirical and theoretical distributions. The energy distance test demonstrates high power against a broad class of alternative hypotheses and extends naturally to multivariate settings. Its metric-based formulation eliminates the need for binning or explicit density estimation, enhancing robustness and flexibility in practical applications. The test can be computationally intensive for large sample sizes because it requires pairwise distance calculations. In addition, the interpretation of effect size may be less intuitive compared to moment- or EDF-based tests.

3. Empirical Framework and Results: Application of GoF Tests in Financial Diagnostics

This section serves as the empirical bridge between the theoretical mathematical framework of Goodness-of-Fit testing and its practical implications in financial market analysis. Moving beyond univariate distribution tests, we implement a comprehensive multi-stage diagnostic framework that utilizes GoF metrics, Extreme Value Theory (EVT), and hierarchical clustering methods to decode the complex, non-Gaussian return structures of prominent equities. The framework evaluates distributional symmetry and tail asymmetry across complementary perspectives—namely benchmark relative to market indices, sectoral coherence with industry ETFs, and risk-based clustering— to reveal latent stochastic symmetries and risk similarities.

3.1. Methodological Framework for Detecting Stochastic Symmetry

The dynamic and often turbulent nature of financial markets frequently causes asset returns to deviate significantly from the Gaussian (normal) distribution, typically exhibiting heavy tails (leptokurtosis) and structural asymmetry (skewness). While traditional portfolio theories primarily focus on the second moments of distributions—such as variance and standard deviation—modern risk management necessitates a profound understanding of the entire distributional profile of assets, particularly concerning extreme tail asymmetry. To prevent spurious diagnostic results caused by volatility clustering, it is methodologically imperative to isolate IID innovations using ARMA-GARCH filtering before applying GoF tests. In this section, the efficacy of the developed Goodness-of-Fit mathematical methodology is evaluated through a comprehensive multi-stage empirical application to assess the stochastic symmetry within a selected portfolio of equities.

Significance of GoF Tests in Assessing Stochastic Symmetry

In financial analysis, GoF tests serve not merely as tools for statistical validation but as sophisticated diagnostic mechanisms capable of measuring stochastic symmetry and distributional similarity between assets. Unlike classical correlation analysis, which only captures linear dependencies, tests such as Kolmogorov–Smirnov (KS), Anderson–Darling (AD), and Epps–Singleton (ES)—complemented by Energy Distance—evaluate tail asymmetry, how assets respond to market shocks, and their overall non-Gaussian return “geometry”.
The selection of the finance-benchmark-sector framework, alongside the newly introduced tail-risk profiling and portfolio simulations, for this application is motivated by five strategic considerations:
  • Benchmark Alignment: Understanding the extent to which an investment aligns with broad market indices (S&P 500, NASDAQ, and DOW) in terms of distributional symmetry—not just price trajectory—is critical for managing systematic risk.
  • Sectoral Homogeneity: Analyzing the stochastic alignment of stocks relative to their respective sectors allows for the decoupling of sectoral pressures from idiosyncratic, asymmetric asset-specific risks.
  • Tail-Risk Profiling: Synthesizing structural divergence metrics (Energy Distance) with extreme tail-risk indicators via Extreme Value Theory (EVT) and robust L-moments isolates 1% and 5% tail asymmetries, quantifying the depth of downside risk independent of volatility scaling.
  • Hierarchical Risk Clustering: Using hierarchical clustering on the combined metrics (structural divergence and tail-risk indicators) enables the rigorous categorization of highly correlated assets into distinct asymmetric risk profiles, revealing risk homology groups that transcend conventional sector classifications.
  • Economic Significance and Portfolio Utility: Dynamically testing the economic value of these diagnostic profiles through bootstrapped backtesting demonstrates that incorporating GoF and tail-risk metrics yields superior downside protection (optimized Expected Shortfall management) compared to random allocations.

3.2. Five-Stage Diagnostic Framework

The analysis follows a progressively granular structure, moving from broad stochastic market alignment to specific asymmetric risk characterization, and concluding with portfolio-level economic validation:
1.
Stage 1: Benchmark Distributional Symmetry—We test how closely individual stocks (AAPL, TSLA, JPM, XOM, JNJ, and WMT) follow the return distributions of major market indices (S&P 500, NASDAQ, and DOW) using ARMA-GARCH filtering. This stage answers: “To what extent does this stock exhibit stochastic symmetry with the overall market”?
2.
Stage 2: Sectoral Stochastic Alignment—We examine whether stocks exhibit return distributions structurally symmetric to their respective sector ETFs using IID-filtered residuals and Energy Distance metrics. This stage isolates sector-specific risk from idiosyncratic tail asymmetry, addressing: “Is this stock’s distributional DNA typical of its sector”?
3.
Stage 3: Extreme Tail Asymmetry—Moving beyond traditional metrics, we isolate 1% and 5% tail behaviors using Extreme Value Theory (EVT) and robust L-moments. This stage quantifies the depth of tail asymmetry and the intensity of jump risks that persist after volatility filtering, answering: “Which assets exhibit the most extreme left-tail behavior and how does it differ from their sector or the broad market”?
4.
Stage 4: Hierarchical Risk Clustering—We then synthesize the tail-risk metrics (L-kurtosis) with structural divergence metrics (Energy Distance) via hierarchical clustering. This creates distinct “risk homology groups,” answering: “Which stocks share similar asymmetric risk and downside genetics, regardless of their conventional sector classifications”?
5.
Stage 5: Economic Significance via Bootstrapped Backtesting—To demonstrate the financial utility of our diagnostic framework, we run a 1000-iteration bootstrapped portfolio simulation. By penalizing weights based on GoF divergence and tail risk, we compare our methodology against random allocations. This final stage answers: “Does utilizing non-parametric distributional diagnostics yield optimized Expected Shortfall management and better risk-adjusted returns”?

3.3. Data Description and Test Selection Rationale

3.3.1. Data Sample and Characteristics

Our empirical analysis employs daily closing price data for six bellwether stocks (AAPL, TSLA, JPM, XOM, JNJ, and WMT) spanning a five-year period from 1 February 2021 to 30 January 2026. The dataset was systematically retrieved via Yahoo Finance’s publicly available API (yfinance), capturing only active trading days. To ensure data continuity and methodological rigor, any missing values or non-trading days were handled using a forward-fill mechanism before calculating logarithmic returns.
These stocks were selected to represent diverse major Global Industry Classification Standard (GICS) sectors (Technology, Consumer Discretionary, Financials, Energy, Healthcare, and Consumer Staples) and varying market capitalizations, providing a comprehensive cross-section of equity market behavior. In terms of relevance to the field of financial application, these highly liquid, mega-cap equities act as the foundational building blocks for countless passive index funds and active sector rotation strategies. Because they are heavily weighted in major indices, portfolio managers implicitly assume their return distributions align symmetrically with broader market movements. Testing this assumption is critical; if a heavily weighted stock exhibits latent tail asymmetry that diverges from its sector, traditional risk models (e.g., Value at Risk) will severely underestimate portfolio downside exposure.
Benchmark indices (S&P 500 via SPY, NASDAQ via QQQ, Dow Jones via DIA) and their rigorously mapped corresponding sector ETFs (XLK, XLY, XLF, XLE, XLV, XLP) provide the comparative distributional frameworks for our multi-stage analysis.

3.3.2. Rationale for Selected Diagnostic Metrics and GoF Tests

From the extensive universe of GoF methodologies reviewed in Section 2, our analytical selection was strictly guided by diagnostic suitability for financial return distributions. We prioritized non-parametric tests and advanced metrics that exhibit: (1) hyper-sensitivity to tail behavior, which is essential for catastrophic risk assessment; (2) robustness against spurious rejections caused by volatility clustering; (3) comparative, structural symmetry measurement capabilities rather than mere parametric fitting; and (4) complementary diagnostic perspectives, allowing methodological triangulation through distinct mathematical properties.
To achieve this, we synthesized a multidimensional diagnostic toolkit. The selected tests and metrics provide complementary, granular insights:
  • ARMA-GARCH IID Filtering: Serves as the foundational pre-processing step to strip away serial correlation and volatility clustering, ensuring that all subsequent GoF diagnostics are applied to IID structural innovations.
  • Kolmogorov–Smirnov (KS): Comprehensive CDF comparison, sensitive to distribution center.
  • Permutation-based Anderson–Darling (AD): Tail-weighted, critical for extreme risk assessment.
  • Cramér–von Mises (CvM): Integrated squared differences providing a more global measure of fit than KS across entire distribution.
  • Epps–Singleton (ES): Characteristic function based, complements moment-based approaches.
  • Energy Distance: Metric-space approach incorporating all distributional moments.
  • Extreme Value Theory (EVT) and L-Moments: While GoF tests evaluate overall symmetry, EVT explicitly isolates the 1% and 5% critical left-tail asymptotics. This is complemented by L-Kurtosis (based on linear combinations of order statistics), which provides a highly robust measure of extreme tail weight, highly immune to the distorting effects of outliers.
  • Jarque–Bera/Shapiro–Wilk: Diagnostic checks for normality assumptions.

3.3.3. Computational Implementation and Numerical Stability

In alignment with the principles of open science and to ensure the reproducibility of our findings, the complete computational framework developed for this study has been made publicly available at the following GitHub repository: https://github.com/asevin85/GoF-Framework-Symmetry (accessed on 22 April 2026). To execute the multi-stage analytical framework, all mathematical estimations and statistical tests were implemented programmatically using Python 3.10. The core computational stack relied on the SciPy library for classical EDF-based GoF tests (KS, CvM) as well as the Epps–Singleton characteristic function and Energy Distance metrics; the arch package for ARMA-GARCH maximum likelihood estimations; scikit-learn for hierarchical clustering; and custom vectorized implementations for the permutation-based Anderson–Darling test and L-moment calculations.
Regarding numerical complexity, the proposed methodology involves significant computational overhead, primarily due to three components: (1) the Energy Distance metric, which requires pairwise distance calculations with a time complexity of O ( N 2 ) ; (2) the permutation-based Anderson–Darling test, requiring repetitive re-sampling ( p = 1000 iterations per asset pair); and (3) the bootstrapped portfolio backtesting. However, given the sample size of approximately 1250 daily observations per asset (5 years), the computational burden remains highly manageable on modern multi-core processors when leveraging NumPy’s C-based vectorized array operations.
Crucially, concerning potential convergence problems, the empirical GoF tests themselves (KS, CvM, AD, ED, and ES) are either exact, metric based, or permutation driven, meaning they do not rely on iterative optimization algorithms. Consequently, they are entirely immune to non-convergence issues. Where iterative estimation was required—specifically in the ARMA-GARCH pre-filtering stage—we employed the default BFGS optimizer implemented in the arch package with strict tolerance thresholds to ensure reliable convergence. Furthermore, to estimate the 1% and 5% tail asymptotics within Extreme Value Theory (EVT), classical maximum likelihood estimation (MLE) is often susceptible to convergence failures in small tail samples. To mitigate this numerical instability, we utilized Probability Weighted Moments (PWMs)—the mathematical foundation of L-moments—which provide closed-form, non-iterative estimators based on linear combinations of order statistics. This strategic choice mitigates optimization-related instability in EVT estimation and ensures robust tail-risk quantification.

3.3.4. Preliminary Diagnostic Checks and Data Preparation

Prior to distributional similarity testing, we conducted essential diagnostic verification and data preparation steps to ensure methodological rigor:
1.
Return Calculation and IID Innovation Extraction: Following standard financial econometric practice [28,29], we employ continuously compounded (log) returns rather than simple percentage returns. This transformation, defined as r t = ln ( P t / P t 1 ) , offers three critical advantages for distributional analysis: (1) time-additivity, facilitating multi-period analysis without the positive bias inherent in compounding simple returns; (2) superior statistical properties, including greater symmetry and better approximation to normality under continuous compounding; and (3) preservation of interpretability, as log returns approximate percentage returns for small changes while avoiding compounding distortions for larger movements.
Crucially, for non-parametric GoF testing, simple Z-score standardization is insufficient because it fails to account for volatility clustering (heteroskedasticity) prevalent in financial time series. Therefore, instead of raw standardized returns, we apply an ARMA-GARCH filtering process to extract standardized, conditionally homoscedastic innovations. This rigorous transformation isolates the pure, underlying distributional shape from time-varying scale dynamics, allowing us to accurately test whether stocks and benchmarks share structurally symmetric geometries without the confounding effects of volatility clustering.
2.
Serial Independence (Ljung–Box Test): We verified the absence of significant serial correlation within our filtered innovations using the Ljung–Box test [30,31]. This diagnostic check is methodologically essential because GoF tests strictly assume independent observations. Financial returns, while typically assumed serially uncorrelated, can exhibit autocorrelation in certain regimes, which would compromise GoF test validity and lead to spurious rejections of the null hypothesis. Our robust pre-processing confirmed no evidence of significant autocorrelation at the 5% level across all analyzed assets. Notably, while initial raw data might exhibit marginal dependencies, our filtered structural innovations fully resolved these issues (e.g., yielding p = 0.148 for TSLA and p = 0.550 for XOM). This compelling finding robustly validates the IID assumption underlying our testing framework.
3.
Normality Assessment: We verified the expected non-normality of all return series (even after GARCH filtering) using Jarque–Bera and Shapiro–Wilk tests ( p < 0.001 for all assets). This confirmation justifies our use of non-parametric GoF tests that do not assume normality and aligns with the well-established stylized fact of non-normal structural innovations in equity markets.
4.
Permutation-Based Anderson–Darling Test: To address the limitations of asymptotic p-value approximations in financial tail analysis, we implement a permutation-based Anderson–Darling test [32,33,34]. Our implementation computes the observed Anderson–Darling statistic A o b s 2 on the IID innovations, then performs 1000 random permutations of the pooled samples. For each permutation, we recalculate A p e r m 2 , and estimate the p-value as p = ( B + 1 ) / ( 1001 ) , where B is the number of permutations with A p e r m 2 A o b s 2 . This permutation-based approach yields reliable p-value estimates while preserving the test’s tail sensitivity, which is critical for financial risk assessment.
This systematic approach—from data selection through advanced IID filtering and diagnostic checks to robust test selection—ensures that our multi-stage GoF analysis addresses the specific challenges of financial return data while maintaining strict methodological rigor and interpretability.
Methodological Contribution
This study contributes to the financial and statistical literature by introducing a comprehensive diagnostic framework that bridges the gap between theoretical distribution testing and applied portfolio management. Specifically, our contributions are five-fold: (1) implementing an advanced ARMA-GARCH pre-processing layer to isolate IID structural innovations, thereby preventing spurious GoF rejections caused by volatility clustering; (2) applying a multi-stage, multi-metric GoF diagnostic framework (incorporating Energy Distance decomposition and permutation-based Anderson–Darling tests) to rigorously evaluate stochastic symmetry and decouple structural drift from tail asymmetry; (3) estimating extreme tail asymmetries at the 1% and 5% thresholds using Extreme Value Theory (EVT) and robust L-moments; (4) synthesizing these tail-risk metrics with structural divergence metrics via hierarchical clustering to dynamically categorize assets into distinct non-Gaussian risk profiles; and (5) demonstrating the profound economic significance of this framework through a bootstrapped portfolio backtest, proving that GoF-derived tail-risk penalties significantly optimize Expected Shortfall and enhance downside protection under market stress.

3.4. Stage 1: Empirical Results of Benchmark Distributional Symmetry

This section presents the empirical findings from the first stage of our multi-tiered diagnostic framework: assessing distributional symmetry and stochastic alignment between individual stocks (AAPL, TSLA, JPM, XOM, JNJ, and WMT) and broad market benchmarks (S&P 500, NASDAQ, and Dow Jones). We integrate multiple complementary analytical approaches: (1) cumulative return trajectories for macroscopic temporal pattern identification, (2) non-parametric Goodness-of-Fit tests applied strictly to IID-filtered innovations for evaluating structural symmetry and tail asymmetry, and (3) traditional risk metrics (beta exposures) for linear dependency assessment. This multi-faceted methodology enables us to decode the stochastic DNA of asset returns, moving beyond simple price tracking to uncover fundamental asymmetric deviations between idiosyncratic stock behaviors and systematic market regimes.
By synthesizing these diverse statistical perspectives, we provide a comprehensive mapping of how idiosyncratic asset behaviors interact with systematic market regimes under varying tail-risk conditions. The following analysis presents robust quantitative evidence through integrated visual diagnostics and consolidated academic reporting of test statistics, offering both methodological transparency and practical financial insights into market symmetry.
Figure 1 displays the cumulative log returns of the six analyzed stocks alongside the three benchmark indices over the five-year period. It is important to note that, while our formal GoF diagnostics are conducted on ARMA-GARCH filtered IID innovations to avoid spurious rejections, this figure intentionally uses raw log returns to provide an unfiltered visual benchmark for comparison. This distinction highlights the difference between market-level performance patterns and the distributional similarity assessed in the formal statistical tests. We observe substantial performance variation across assets, with XOM achieving the highest cumulative return (1.342 log return) and TSLA the lowest (0.397). AAPL exhibits near-identical cumulative performance to the S&P 500 (0.682 vs. 0.681), suggesting potential distributional alignment that our formal statistical tests will evaluate.
Calculated beta values relative to the S&P 500 range from 0.21 (JNJ) to 2.01 (TSLA), indicating diverse systematic risk exposures. Notably, TSLA exhibits the highest systematic risk but the lowest cumulative return, while XOM demonstrates strong performance with moderate systematic risk—a paradox that challenges simple risk–return frameworks and directly hints at the presence of severe underlying left-tail asymmetry.
The cumulative trajectories provide visual intuition about relative performance but do not directly inform distributional similarity—a critical distinction addressed through our formal Goodness-of-Fit testing in the following section. These paths visually confirm that similar cumulative outcomes (such as AAPL tracking the S&P 500) can arise from fundamentally different return generation processes, heavily motivating our structurally focused, IID-based distributional diagnostic approach.

3.4.1. Goodness-of-Fit Test Results: Quantitative Evidence

Table 3 demonstrates that once volatility clustering is rigorously controlled via ARMA-GARCH filtering, individual stock innovations exhibit diverse levels of structural alignment with market benchmarks:
  • Structural Symmetry in High-Volatility Assets: A primary finding is that the TSLA IID innovations exhibit high structural symmetry with all three benchmarks (e.g., DOW: KS p = 0.891 , AD p = 0.867 ). This confirms that while TSLA is characterized by high time-varying volatility, its underlying return geometry is fundamentally aligned with systematic market regimes once scale effects are removed.
  • Evidence of Distributional Divergence in Defensive Assets: WMT exhibits the most pronounced distributional departure, particularly from the NASDAQ (ES p = 0.006 ***). Despite its status as a defensive consumer staple, the WMT structural innovations fail to align with the distributional DNA of growth-heavy indices, highlighting the presence of idiosyncratic tail risk that survives the filtering process.
  • Complementary Diagnostic Sensitivity: The results underscore the necessity of the Epps–Singleton (ES) test, which detects structural mismatches in the characteristic function domain where CDF-based tests may not. For AAPL against the NASDAQ, the ES test rejects distributional similarity ( p = 0.036 **) despite high KS and AD p-values, indicating differences in higher-order moments or frequency-domain structures.
  • Financial Sector Integration: JPM exhibits consistently high KS and AD p-values across all benchmarks ( p > 0.56 ), indicating robust structural alignment. However, the ES test yields a relatively lower p-value for the NASDAQ ( p = 0.171 ) compared to other indices, further underscoring the granular sensitivity of characteristic-function-based diagnostics in detecting latent distributional nuances.
The analysis further reveals a significant “Decoupling Effect” between linear and structural measures:
  • Independence of Correlation and Symmetry: High linear correlation does not guarantee structural symmetry. AAPL shows strong correlation with NASDAQ ( 0.779 ) yet fails the ES symmetry test. Conversely, TSLA exhibits lower correlation with the DOW ( 0.431 ) but maintains near-perfect structural alignment. This decoupling demonstrates that distributional symmetry represents a distinct risk dimension from traditional dependency metrics.
  • Enhanced Tail Sensitivity: The consistently lower p-values produced by the Anderson–Darling (AD) test compared to the Kolmogorov–Smirnov (KS) test (e.g., WMT vs. S&P 500: K S = 0.231 vs. A D = 0.102 ) validate our choice of AD for capturing latent asymmetries in distribution tails.
Methodological Validation
The Ljung–Box test results ( p > 0.148 ) confirm that the GARCH-based innovations are free from serial correlation, ensuring the statistical validity of the GoF diagnostics. Normality tests (all p < 0.001 ) confirm that even after filtering for heteroskedasticity, financial returns retain non-Gaussian characteristics, necessitating our non-parametric approach.

3.4.2. Visual Diagnostic: Goodness-of-Fit Test Heatmaps

Figure 2 presents a multi-panel heatmap visualization of the Goodness-of-Fit test results applied to IID innovations, employing a green-yellow color spectrum where darker green tones indicate higher p-values (stronger structural symmetry) and yellow tones indicate lower p-values (distributional divergence). This graduated color scheme allows for a nuanced interpretation beyond binary significance thresholds, visually highlighting how structural alignment varies across benchmarks and diagnostic metrics.
The heatmaps reveal gradient-based patterns that convey distributional relationships after heteroskedasticity has been removed:
  • Dark Green Clusters (Structural Alignment): JPM appears as a generally dark green anchor across most tests and benchmarks (KS, CvM, and AD), visually confirming its role as a reliable market distribution proxy, though the ES test against NASDAQ (0.171) indicates a slight nuance in higher-order moments. Notably, the TSLA-DOW and AAPL-DOW pairs exhibit intense dark green cells for the Kolmogorov–Smirnov and Cramér–von Mises tests (p-values > 0.80 ), with the Epps–Singleton test also showing moderate green ( p > 0.45 ), illustrating that the underlying innovations of these volatile assets are structurally synchronized with industrial-heavy indices.
  • Yellow Regions (Structural Divergence): WMT emerges as the primary source of yellow patterns, particularly in the Epps–Singleton (ES) and Anderson–Darling (AD) panels against the NASDAQ ( p = 0.006 and p = 0.093 , respectively). These yellow “islands” visually identify where the structural DNA of a defensive asset fails to align with the systematic movement of growth-oriented indices.
  • Green-to-Yellow Gradients (Tail Sensitivity): In contrast to traditional price-based analysis, the Anderson–Darling panel displays a lighter green/yellowish tint compared to the Kolmogorov–Smirnov panel for several pairs (e.g., AAPL-S&P 500: KS 0.778 vs. AD 0.457). This visible gradient confirms AD’s superior sensitivity to tail behavior, identifying subtle structural asymmetries that the center-weighted KS test overlooks.
Each benchmark column exhibits characteristic color patterns:
  • DOW Column Dominance: The rightmost column (DOW comparisons) consistently shows the darkest greens across all tests, spanning TSLA (0.891), JPM (0.809), and AAPL (0.809). This visual pattern suggests the Dow Jones index provides the most robust structural fit for these stocks once volatility scales are normalized.
  • NASDAQ Column Heterogeneity: The NASDAQ column displays the greatest color variability, ranging from the dark green of JPM to the bright yellow of WMT ( p < 0.01 ) and AAPL/TSLA ES tests ( p < 0.04 ). This spectrum reflects the technology index’s selective structural alignment, which is highly sensitive to the specific moment structures of individual assets.
  • S&P 500 Standard: The S&P 500 column shows stable, intermediate green coloring for most pairs, capturing its role as a general market benchmark with reliable but not absolute distributional alignment.
Each test panel now exhibits unique color distribution properties:
  • Kolmogorov–Smirnov Panel: Shows a predominantly green landscape, reflecting the focus of KS on the empirical CDF center, where most filtered innovations exhibit high symmetry.
  • Anderson–Darling Panel: Displays more varied and lighter tones than the KS panel. The appearance of yellow-green transitions (e.g., WMT at 0.093 and AAPL at 0.346) visually validates our decision to use AD as a more rigorous detector of tail-driven divergence.
  • Cramér–von Mises Panel: Mirrors the KS panel’s stability but with more granular color steps, reflecting its integrated approach to measuring structural distance across the entire distribution.
  • Epps–Singleton Panel: Contains the most aggressive color contrasts, with multiple yellow cells for AAPL, TSLA, and WMT. This panel acts as a “diagnostic outlier detector”, drawing attention to relationships where moment structures (characteristic functions) differ even when CDF shapes appear similar.
Figure 3 presents the complete Pearson correlation matrix. The analysis reveals that while stock-to-stock correlations are generally weak to moderate (mostly between 0.080 and 0.497), stock-to-benchmark relationships show stronger linear dependencies. Benchmark indices exhibit exceptionally high mutual correlations (SPY-QQQ: 0.947, SPY-DIA: 0.919), indicating substantial common variation. Sector patterns emerge clearly: technology stocks correlate strongly with QQQ (AAPL: 0.779, TSLA: 0.637), financials with DIA (JPM: 0.715), while energy (XOM) and defensive (JNJ, WMT) stocks show weaker benchmark alignment.
Notably, XOM demonstrates persistent market isolation with correlations below 0.41 against all benchmarks. These linear patterns complement our distributional findings, highlighting the divergence between correlation and structural symmetry. For instance, while AAPL shows a strong correlation with NASDAQ (0.779), it exhibits structural divergence in the ES test ( p = 0.036 **). Conversely, TSLA exhibits moderate correlation with DIA (0.432) but remarkably high structural similarity ( K S p = 0.891 , A D p = 0.867 from Table 3). This decoupling underscores that linear dependency and distributional shape represent distinct statistical dimensions.
Figure 4 illustrates the 252-day rolling annualized volatility, providing a temporal dimension to our distributional diagnostics. The visualization reveals pervasive heteroscedasticity across the sample, with TSLA exhibiting substantially higher volatility (approximately 3.5 × that of the S&P 500) throughout the analysis period. Defensive assets, particularly JNJ and WMT, demonstrate notably lower and more stable risk profiles, while benchmark indices display moderate volatility levels. DIA represents the most conservative market proxy with the lowest mean volatility (0.147).
The rolling volatility analysis provides critical justification for our GARCH-based pre-filtering. We observe two distinct diagnostic phenomena. First, scale-invariant similarity: after extracting structural innovations, assets such as TSLA and DIA exhibit high distributional similarity despite their massive disparity in volatility magnitude. This confirms that our IID-based diagnostics successfully isolate the shape and structural characteristics of risk from its mere scale. Second, risk-magnitude convergence versus structural divergence: assets may share similar volatility magnitudes (e.g., AAPL and XOM) while exhibiting fundamental distributional decoupling in the tails. This distinction directly motivates our dual Energy Distance decomposition in Stage 2, where we explicitly separate magnitude-driven total distance (ED_Total) from shape-based structural divergence (ED_Shape).

3.5. Stage 2: Sectoral Distributional Symmetry Analysis

Building upon the benchmark distributional symmetry assessment, Stage 2 investigates whether individual stocks exhibit stochastic symmetry and return distributions that structurally align with their respective economic sectors. This analysis addresses a fundamental question in financial diagnostics: Do the structurally filtered innovations of stocks exhibit stronger stochastic symmetry with their sectoral peers, or do they share latent asymmetric tail behaviors with the broader market? While conventional sector classifications are widely used in portfolio construction and risk management, their statistical validity from a non-Gaussian distributional perspective—particularly regarding tail asymmetry and heteroskedasticity-adjusted structural shape—remains empirically underexplored.
The methodological framework employs a sector-ETF mapping approach where each stock is paired with its corresponding SPDR sector ETF: Technology (XLK for AAPL), Consumer Discretionary (XLY for TSLA), Financials (XLF for JPM), Energy (XLE for XOM), Healthcare (XLV for JNJ), and Consumer Staples (XLP for WMT). This mapping enables a direct comparison between individual stocks and diversified sector proxies, isolating sector-specific distributional characteristics from idiosyncratic stock behaviors. The analytical approach mirrors Stage 1, utilizing a parallel suite of non-parametric Goodness-of-Fit tests—Kolmogorov–Smirnov, Cramér–von Mises, and Epps–Singleton—applied strictly to conditionally homoscedastic (IID) innovations. Crucially, to operationalize the scale-versus-shape decoupling motivated in Stage 1, this suite is now supplemented by a novel Dual Energy Distance framework (decoupling Total Energy Distance from Shape-based Energy Distance) alongside Jarque–Bera normality diagnostics.
Stage 2 tests two competing hypotheses: the Sector Dominance Hypothesis, which posits that asset innovations exhibit stronger stochastic symmetry with their sector ETFs than with market benchmarks, thereby validating traditional structural sector-based classification; and the Market Integration Hypothesis, which suggests stocks show greater structural alignment with market indices than sector ETFs, indicating that systemic factors dominate sector-specific characteristics. The analysis addresses several diagnostic questions, including which sectors demonstrate the strongest internal stochastic symmetry, whether certain stocks consistently exhibit asymmetric deviations from both market and sector benchmarks (indicating idiosyncratic, non-Gaussian return generation processes), and how these structural symmetries compare with traditional linear correlation-based sector analysis. This stage provides critical insights for portfolio diversification, sector rotation strategies, and tail-risk management by quantifying the mathematical foundations of sector-based investment approaches under strictly structurally normalized market conditions.

3.5.1. Empirical Results: Sectoral Distributional Symmetry

Table 4 presents the sectoral similarity analysis, utilizing ARMA-GARCH filtered IID innovations and a Dual Energy Distance framework to reveal striking sector-specific patterns with direct implications for portfolio construction.
Sectoral Coherence vs. Structural Divergence: The results demonstrate a clear hierarchy of sectoral coherence. At the level of the selected representative assets, the energy and financial sectors exhibit exceptional internal alignment. XOM represents a perfect case of “Sectoral Insularity”: demonstrating near-perfect distributional coherence with XLE (correlation: 0.932, KS p = 0.646 , ES p = 0.883 ) and the smallest ED_Shape (0.0280) among all pairs. JPM similarly demonstrates moderate-to-strong structural alignment with XLF (KS p = 0.646, ES p = 0.194), supporting the financial sector’s relative homogeneity.
Conversely, WMT exhibits the most severe distributional divergence from its sector ETF (XLP), significantly rejecting the null hypothesis across all primary tests (CvM p = 0.029 **, ES p < 0.001 ***). Despite being a classic defensive staple, WMT harbors an idiosyncratic structural geometry that fundamentally decouples from its sectoral proxy. Furthermore, technology and consumer stocks (AAPL, TSLA) along with healthcare (JNJ) show high linear correlations but fail the Epps–Singleton test ( p < 0.05 ), indicating substantial divergence in their higher-order moment structures despite synchronized directional movements. Also, JNJ presents a borderline case: while KS test suggests weak similarity (0.076 *), the ES test rejects distributional equality (0.011 **).
Decoupling Scale from Shape via Dual Energy Distance: Our Dual Energy Distance framework explicitly isolates differences caused by volatility magnitude (ED_Total) from fundamental structural geometry (ED_Shape). This decomposition provides critical diagnostic insights:
  • Volatility-Driven Divergence (TSLA): TSLA presents a significantly larger ED_Total (0.0641) compared to its ED_Shape (0.0509). This proves that a substantial portion of TSLA’s divergence from XLY is an artifact of its massive volatility scale, rather than just pure shape distortion.
  • Structure-Driven Divergence (WMT): In stark contrast, for WMT, ED_Shape (0.0826) is exponentially larger than ED_Total (0.0164). This confirms that the WMT divergence is almost entirely rooted in fundamental structural shape and tail behavior, independent of volatility magnitude.

3.5.2. Visual Diagnostics: Sectoral GoF Heatmap

The integration of these quantitative findings with visual heuristics (Figure 5) clearly communicates the validity of sector-based groupings. The dark green coherence of JPM and XOM suggests their respective sector ETFs are highly reliable distributional proxies for tail-risk hedging. Conversely, WMT displays a sharp gradient ending in a dark yellow warning zone (ES: <0.001), visually encoding its structural isolation. AAPL and TSLA occupy a middle ground of muted green/yellow tones, effectively communicating that while they co-move linearly with their sectors, their underlying return geometries deviate significantly. Ultimately, this integrated approach moves beyond binary hypothesis testing to deliver actionable intelligence: structural divergence can be heavily scale driven (TSLA) or purely shape driven (WMT), requiring fundamentally different risk management strategies.

3.5.3. Visualizing Structural Geometry: KDE Overlap Analysis

Figure 6 presents Kernel Density Estimation (KDE) plots comparing the standardized structural innovations (GARCH residuals) of each stock with its corresponding sector ETF. This visualization provides intuitive geometric evidence for the distributional (dis)similarity quantified by our Dual Energy Distance and GoF framework.
The plots reveal varying degrees of geometric alignment that perfectly mirror the statistical test results:
  • Structural Center Alignment (AAPL & JPM): AAPL exhibits a peak difference of exactly 0.000, while JPM shows a negligible deviation of 0.020. This near-perfect alignment at the mode of the distribution visually confirms that their structural innovations are centered identically with their respective sectors (XLK and XLF), validating their moderate-to-high KS p-values (0.272 and 0.646) despite potential differences in higher-order moments.
  • High Structural Coherence (XOM): XOM vs. XLE displays robust geometric alignment despite a peak shift of 0.120. The minimal shape divergence (ED_Shape = 0.0280, the lowest in the sample) corroborates the energy sector’s tight distributional consistency across the entire density profile.
  • Persistent Tail Asymmetry (JNJ): JNJ demonstrates a peak mismatch of 0.080 and, more importantly, an extended negative innovation range reaching −7.89 compared to XLV’s −6.52. This visual distortion in the lower tail directly explains its marginal statistical alignment (KS p = 0.076 *) and higher ED_Shape (0.0538).
  • Extreme Tail Divergence (WMT): WMT demonstrates the most pronounced visual divergence. While its peak difference is relatively small (0.040), its innovation distribution exhibits massive tail risk, with the range expanding to −9.70—the most extreme value in the dataset—compared to that of XLP, −6.72. This geometric evidence of tail-driven decoupling clearly visualizes why WMT suffers the strongest structural rejection (ED_Shape = 0.0826, KS p = 0.056 *).
  • Systemic Peak Shifts (TSLA): TSLA shows the largest peak difference (0.140) among all pairs, indicating a fundamental shift in the “most frequent” return outcomes relative to XLY. This, combined with its broader innovation range ([−4.99, 6.01]), confirms that TSLA operates under a different structural geometry than its consumer discretionary peers.
These KDE plots effectively demonstrate that even after removing heteroskedasticity via GARCH filtering, structural divergence remains significant. The findings suggest that sectoral decoupling is driven either by “Center Shifts” (as in TSLA) or “Tail Dilations” (as in WMT and JNJ), each requiring distinct risk management approaches beyond simple volatility scaling.

3.6. Stage 3: Extreme Risk Profiling via EVT and L-Moments

Following the structural symmetry analysis, Stage 3 employs Extreme Value Theory (EVT) and L-Moments to address the “tail paradoxes” identified in the preceding sections. By utilizing GARCH-filtered, IID innovations, we ensure that the following tail estimates are not contaminated by volatility clustering, providing a robust diagnostic of the assets’ latent risk structures across individual stocks, sector ETFs, and major market benchmarks (SPY, QQQ, and DIA).

3.6.1. L-Moment Decomposition and Structural Divergence

To elucidate why certain assets consistently reject distributional symmetry, we utilize L-Moments, which provide a robust decomposition of higher-order dynamics without the sensitivity to outliers inherent in traditional moments. As shown in Table 5, the L-Kurtosis ( L _ K u r t ) identifies a clear structural hierarchy. WMT (0.2184) and JNJ (0.2017) exhibit the highest L-Kurtosis in the sample, significantly exceeding both market benchmarks (SPY: 0.1715) and their respective sectors (XLP: 0.1569; XLV: 0.1545). This mathematical evidence confirms that their structural divergence is driven by an inherent “peakedness” and tail-heaviness that persists even after strictly controlling for heteroskedasticity.

3.6.2. Tail Dynamics and the “Jump-Information” Evidence

The analysis of tail indices at the critical 1% threshold provides the necessary econometric foundation to address the “jump information” concern. These findings confirm that structural divergence remains significant even after removing volatility clumping:
  • The WMT Tail Crisis: WMT exhibits a Z-Min of −10.2465 and a 1% Left Tail Index of 2.0683, representing the most extreme tail dilation in the dataset. This indicates a “super-fat” left tail that is fundamentally more dangerous than its sector (XLP: 3.5619) or the broad market (SPY: 3.4955). This validates that GARCH-based normalization does not discard crucial information regarding the intensity of jumps; rather, it isolates it as an idiosyncratic structural trait.
  • TSLA Asymmetry Paradox: TSLA displays a Tail Asymmetry of 0.9949, confirming near-perfect stochastic symmetry at a macro level. However, at the 1% extreme, its Right Tail Index (2.7165) is significantly lower (fatter) than its Left Tail Index (4.3041). This suggests that the TSLA extreme shocks are predominantly positive, a unique characteristic that differentiates it from technology leaders like AAPL (Tail_Asym: 1.2435), which are dominated by left-tail risk.
  • Sectoral Insularity in Energy: XOM and XLE exhibit a unique right-tail dominance (Tail_Asym < 1). Notably, the XLE 1% Right Tail Index is exceptionally high (9.9389), indicating a rapid decay in extreme gains compared to the broad market benchmarks like QQQ (Tail_Asym: 0.9043). This structural alignment between XOM and XLE validates the “Sectoral Insularity” hypothesis under extreme conditions.
In conclusion, Stage 3 proves that the discrepancies observed in GoF tests are rooted in inherent tail dilation and higher-order structural decoupling. By identifying these features in IID innovations, the study provides a robust diagnostic model that justifies the use of Energy Distance as a superior metric for Expected Shortfall (ES) and non-Gaussian risk management.

3.7. Stage 4: Risk-Based Hierarchical Clustering

To synthesize the structural and tail-risk dimensions uncovered in the previous stages, we perform an agglomerative hierarchical clustering of the six stocks using two complementary metrics: the shape component of Energy Distance ( E D _ S h a p e , capturing structural divergence) and L-kurtosis ( L _ K u r t , capturing robust tail heaviness). Figure 7 presents the resulting two-dimensional scatter plot, where the three distinct clusters are determined by Ward’s linkage and the Euclidean distance.
The clustering reveals a clear, economically intuitive risk taxonomy:
  • Low-Risk/Safe Haven (Green): XOM occupies this region alone. It exhibits the lowest E D _ S h a p e (0.0280) and the lowest L-kurtosis (0.1544) in the sample, confirming that energy sector assets display both strong structural alignment with their sector proxy and very limited tail risk. This profile justifies the status of XOM as a distributionally stable asset within this framework.
  • Moderate Risk/Gray Zone (Orange): AAPL, JPM, JNJ, and TSLA form the central cluster. Their E D _ S h a p e values range from 0.0385 (JPM) to 0.0538 (JNJ), and L-kurtosis from 0.1908 (TSLA) to 0.2017 (JNJ). These assets are moderately aligned with their sectors but exhibit non-negligible tail heaviness. Within this group, TSLA stands out with its near-perfect symmetry ( T a i l _ A s y m 1 ), while AAPL and JNJ show pronounced structural skewness.
  • High Risk/Toxic (Red): WMT is isolated as the single high-risk outlier. It displays the largest L-kurtosis (0.2184) and the highest E D _ S h a p e (0.0826), driven by an extremely heavy left tail ( Z M i n = 10.25 ) and a sharp structural decoupling from its consumer staples (XLP) benchmark. This outlier status provides an “economic proof of concept” for the extreme rejection of distributional similarity observed in Stage 2 ( E S _ p < 0.001 ).
These groupings demonstrate that distributional GoF diagnostics, when combined with robust L-moment measures, move beyond binary hypothesis testing to construct actionable risk taxonomies. The results directly support the “Sectoral Insularity” of Energy (XOM) while exposing the hidden tail danger in a seemingly defensive consumer staple (WMT). Consequently, portfolio managers should treat sector-based diversification with caution, as structural outliers like WMT require dedicated tail-risk hedging regardless of their sector classification.

3.8. Stage 5: Economic Significance and Bootstrapped Portfolio Validation

To evaluate the economic relevance of the proposed distributional framework, we conduct a bootstrapped portfolio backtest. The simulation directly tests whether the structural metrics derived from our Goodness-of-Fit framework (Energy Distance, KS p-values, and L-kurtosis) can be translated into superior risk-adjusted portfolio performance compared to purely random allocation.

3.8.1. Methodological Setup

Using the six stocks in our sample, we generate 1000 random portfolio weights from a Dirichlet distribution (uniform prior), ensuring that each random portfolio represents a diversified yet arbitrary allocation. The average Sharpe ratio of these 1000 random portfolios serves as the null benchmark (mean Sharpe = 0.878). We then construct three hierarchical portfolio models with increasing diagnostic sophistication:
1.
Model 1 (ED Only): Weights are inversely proportional to ED_Shape. This penalizes assets with high structural divergence from their sector ETF.
2.
Model 2 (ED + KS): Weights are proportional to ( 1 / E D _ S h a p e ) × K S _ p . This rewards stocks that are both structurally similar and statistically indistinguishable from their sector benchmark.
3.
Model 3 (ED + KS + L-kurtosis): Weights are proportional to ( 1 / E D _ S h a p e ) × K S _ p L - kurtosis . This introduces a penalty for assets with excessively heavy tails (high L-kurtosis), directly incorporating the tail-risk information from Stage 3.
All portfolios are rebalanced annually, and their out-of-sample performance is evaluated using annualized Sharpe ratio and Expected Shortfall (CVaR) at the 5% tail level.

3.8.2. Results

Table 6 summarizes the performance of the three models against the bootstrap null distribution.
The results provide strong economic evidence for the value of GoF-based diagnostics:
  • Progressive improvement: Each additional diagnostic layer improves the Sharpe ratio and the outperformance rate. Model 3 achieves a Sharpe of 1.112, which is 26.6% relative improvement over the average performance of 1000 randomly constructed portfolios, and outperforms 80.3% of the 1000 random allocations.
  • Tail-risk trade-off: The Expected Shortfall (CVaR) of Model 3 (−2.74%) is slightly more negative than that of Model 1 (−2.58%), indicating a marginal increase in tail risk. Nevertheless, the substantial improvement in Sharpe ratio (from 0.968 to 1.112) demonstrates a favorable risk–return trade-off, where higher returns are achieved with only a modest increase in downside tail exposure.
  • Rejection of pure diversification bias: Because the random portfolios already incorporate arbitrary diversification, the fact that all three ED-based models outperform the random mean (and a large fraction of random draws) demonstrates that the observed distributional differences are not merely artifacts of ETF diversification; they reflect economically meaningful structural information.
Figure 8 displays the kernel density of Sharpe ratios for the 1000 random portfolios (gray). Vertical lines indicate the Sharpe ratios of Model 1 (blue), Model 2 (orange), and Model 3 (red). Model 3 clearly lies in the right tail of the null distribution, providing visual evidence of economic significance. In summary, this bootstrapped backtest provides both a control for diversification bias and an economic proof-of-concept. The results confirm that integrating Goodness-of-Fit diagnostics—especially the composite score that includes tail-risk information—yields portfolios with superior risk-adjusted returns, albeit with a negligible increase in downside tail risk. This trade-off validates the practical relevance of our distributional framework for investors who prioritize overall risk–return efficiency.

4. Conclusions

This study has demonstrated the significant diagnostic value of Goodness-of-Fit tests in financial analysis, moving beyond traditional correlation-based approaches to provide a more comprehensive understanding of distributional relationships and tail asymmetries between assets. Methodologically, we introduced a stringent pre-processing layer based on ARMA-GARCH filtering to isolate IID innovations, thereby guaranteeing the validity of subsequent non-parametric GoF tests. By employing a suite of non-parametric GoF tests—specifically Kolmogorov–Smirnov, Anderson–Darling, Cramér–von Mises, and Epps–Singleton—alongside a dual Energy Distance decomposition (separating total distance from shape-driven divergence), we have quantified the stochastic DNA and tail-risk profiles of non-Gaussian financial returns with high precision.
Our empirical findings in Stage 1 lead to three fundamental conclusions. First, volatility masks structure: TSLA exhibits near-perfect distributional symmetry with the DOW (KS p = 0.891, AD p = 0.867) once heteroskedasticity is removed, demonstrating that high volatility does not imply distributional divergence. Second, defensive assets can hide extreme tail risk: WMT, despite its low-volatility profile, shows the most significant departure from growth indices (ES p = 0.006 ***), proving that linear correlations and traditional risk metrics are insufficient for capturing tail behavior. Third, correlation is not a proxy for symmetry: AAPL correlates strongly with NASDAQ (0.779) yet fails the ES test, while TSLA correlates weakly with DOW (0.431) yet maintains high structural alignment. Together, these findings demonstrate that distributional symmetry constitutes a distinct, economically meaningful risk dimension that standard linear frameworks overlook.
In Stage 2, sectoral analysis reveals three key insights. First, sectoral coherence is heterogeneous: Energy (XOM/XLE) and Financials (JPM/XLF) exhibit strong internal alignment, while Consumer Staples shows a dramatic breakdown—WMT diverges sharply from XLP (ES p < 0.001 ***, ED_Shape = 0.0826), proving that a defensive asset can hide extreme tail risk. Second, dual Energy Distance distinguishes two divergence types: the TSLA deviation is volatility driven (ED_Total > ED_Shape), whereas that of WMT is pure shape distortion (ED_Shape > ED_Total). Third, technology and healthcare stocks (AAPL, TSLA, and JNJ) correlate strongly with their sectors yet fail the Epps–Singleton test (p < 0.05), revealing hidden higher-moment differences. Thus, traditional sector classifications miss critical tail risk, and dual ED provides a powerful tool to detect structural misalignment.
Extending the analysis to tail dynamics (Stage 3), we applied Extreme Value Theory (EVT) to compute Hill tail indices at the 1% and 5% thresholds. The left-tail indices of TSLA and DOW are remarkably close at both levels (5%: 3.33 vs. 3.64; 1%: 4.30 vs. 4.53), resolving the apparent paradox of high distributional similarity despite different sector affiliations. L-moments provide robust measures of tail heaviness, revealing that the WMT extreme kurtosis is not an outlier artifact but a genuine structural feature (L-kurtosis = 0.2184).
In Stage 4, we perform hierarchical agglomerative clustering using ED_Shape and L-kurtosis. The algorithm naturally separates the six stocks into three economically meaningful risk profiles: a low-risk “Safe Haven” (XOM), a moderate “Gray Zone” (AAPL, JPM, JNJ, and TSLA), and a high-risk “Toxic” cluster (WMT). This clustering demonstrates that GoF diagnostics can move beyond pairwise testing to construct actionable asset taxonomies.
Finally, Stage 5 provides an economic proof-of-concept through a bootstrapped portfolio backtest. Based on 1000 random portfolios, the average Sharpe ratio is 0.878. Our ED-based composite score (Model 3, which integrates ED_Shape, KS p-values and an L-kurtosis penalty) achieves a Sharpe ratio of 1.112, outperforming 80.3% of the random portfolios and improving risk-adjusted returns by 26.6%. The Expected Shortfall (CVaR) increases marginally (from −2.58% to −2.74%), reflecting a slight rise in tail risk. However, the Sharpe ratio improves substantially (by 26.6%), indicating that the framework achieves a better risk–return balance overall.
The practical implications of this research are two-fold. First, for portfolio construction, our mathematical framework provides a robust tool to identify true diversifiers—assets that not only move differently but exhibit structural asymmetry in their response to market shocks. The bootstrapped backtest validates that integrating GoF metrics with tail-risk penalties yields portfolios with higher Sharpe ratios with only a marginal increase in downside tail risk. Second, for risk management, the visual–numerical concordance established through KDE overlaps, EVT tail indices, and GoF diagnostics offers a more granular understanding of asymmetric tail-risk exposure, enabling more effective hedging strategies using sector ETFs under non-normal market conditions.
While this study focused on benchmark and sectoral symmetry analysis, future research could extend this framework to additional asset classes, regime-switching models, and dynamic portfolio optimization that explicitly incorporates time-varying GoF measures. Potential directions include examining stochastic symmetry during different market regimes, applying similar methodologies to cryptocurrencies or fixed income, and developing optimization frameworks that directly minimize Energy Distance or maximize tail-similarity scores.
In conclusion, this research contributes to the mathematical finance and econometrics literature by demonstrating how non-parametric Goodness-of-Fit tests, traditionally used merely for statistical validation, can serve as powerful diagnostic mechanisms for understanding complex, asymmetric financial relationships. The integration of ARMA-GARCH filtering, dual Energy Distance, EVT tail indices, L-moments, hierarchical clustering, and bootstrapped backtesting creates a unified, rigorous framework that connects distributional theory to economic value. By moving beyond first and second moments to examine entire distributional profiles, GoF methodologies offer deeper insights into stochastic asset behavior, asymmetric risk transmission mechanisms, and the validity of conventional financial classifications—providing valuable mathematical tools for both researchers and practitioners in increasingly complex, non-Gaussian financial markets.

Author Contributions

Conceptualization, A.S. and A.A.B.; methodology, A.S. and A.A.B.; software, A.S.; validation, A.S. and A.A.B.; formal analysis, A.S.; investigation, A.S. and A.A.B.; resources, A.S. and A.A.B.; data curation, A.S.; writing—original draft preparation, A.S. and A.A.B.; writing—review and editing, A.S. and A.A.B.; visualization, A.S.; supervision, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

In alignment with the principles of open science and to ensure the reproducibility of our findings, the complete computational framework developed for this study has been made publicly available. This includes the Python implementations for ARMA-GARCH filtering, non-parametric Goodness-of-Fit (GoF) testing suite, Hierarchical Risk Clustering, and the bootstrapped portfolio backtesting engine. Researchers and practitioners can access the source code, along with the diagnostic tools and data processing scripts, at the following GitHub repository: https://github.com/asevin85/GoF-Framework-Symmetry (accessed on 22 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. González-Manteiga, W.; Crujeiras, R.M. An updated review of goodness-of-fit tests for regression models. Test 2013, 22, 361–411. [Google Scholar] [CrossRef] [PubMed]
  2. D’Agostino, R.B.; Stephens, M.A. Goodness-of-Fit Techniques. In Statistics: Textbooks and Monographs; Marcel Dekker: New York, NY, USA, 1986; Volume 68. [Google Scholar]
  3. Joe, H. Multivariate Models and Multivariate Dependence Concepts; CRC Press: Boca Raton, FL, USA, 1997. [Google Scholar]
  4. Nelsen, R.B. An Introduction to Copulas; Springer Series in Statistics; Springer: New York, NY, USA, 2006. [Google Scholar]
  5. Aas, K.; Czado, C.; Frigessi, A.; Bakken, H. Pair-copula constructions of multiple dependence. Insur. Math. Econ. 2009, 44, 182–198. [Google Scholar] [CrossRef]
  6. Patton, A. Copula methods for forecasting multivariate time series. Handb. Econ. Forecast. 2013, 2, 899–960. [Google Scholar]
  7. Vasileiou, E. Inaccurate value at risk estimations: Bad modeling or inappropriate data? Comput. Econ. 2022, 59, 1155–1171. [Google Scholar] [CrossRef]
  8. Wang, C.; Gerlach, R.; Chen, Q. A semi-parametric conditional autoregressive joint value-at-risk and expected shortfall modeling framework incorporating realized measures. Quant. Financ. 2023, 23, 309–334. [Google Scholar] [CrossRef]
  9. Cont, R. Empirical properties of asset returns: Stylized facts and statistical issues. Quant. Financ. 2001, 1, 223. [Google Scholar] [CrossRef]
  10. Bianchia, M.L.; Del Vecchioa, L.; Starab, F.M. Are parametric models still useful to measure the market risk of bank securities holdings? Borsa Istanb. Rev. 2025, 25, 1663–1681. [Google Scholar] [CrossRef]
  11. Tassinari, G.L.; Bianchi, M.L.; Fabozzi, F.J. Measuring Market Risk in Asset Management. J. Portf. Manag. 2024, 51, 28. [Google Scholar] [CrossRef]
  12. Pearson, K. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can Be Reasonably Supposed to Have Arisen from Random Sampling. Philos. Mag. 1900, 50, 157–175. [Google Scholar] [CrossRef]
  13. Moore, D.S. Tests of Chi-Squared Type. In Goodness-of-Fit Techniques; D’Agostino, R.B., Stephens, M.A., Eds.; Marcel Dekker: New York, NY, USA, 1986. [Google Scholar]
  14. Cirrone, G.A.P.; Donadio, S.; Guatelli, S.; Mantero, A.; Mascialino, B.; Parlati, S.; Pia, M.G.; Pfeiffer, A.; Ribon, A.; Viarengo, P. A Goodness-of-Fit Statistical Toolkit. IEEE Trans. Nucl. Sci. 2004, 51, 2056–2064. [Google Scholar] [CrossRef]
  15. Stephens, M.A. EDF Statistics for Goodness of Fit and Some Comparisons. J. Am. Stat. Assoc. 1974, 69, 730–737. [Google Scholar] [CrossRef]
  16. Kolmogorov, A.N. Sulla determinazione empirica di una legge di distribuzione. G. Dell’Istituto Ital. Degli Attuari 1933, 4, 83–91. [Google Scholar]
  17. Smirnov, N.V. Table for Estimating the Goodness of Fit of Empirical Distributions. Ann. Math. Stat. 1948, 19, 279–281. [Google Scholar] [CrossRef]
  18. Cramér, H. On the Composition of Elementary Errors. Scand. Actuar. J. 1928, 1, 13–74. [Google Scholar] [CrossRef]
  19. von Mises, R. Wahrscheinlichkeit, Statistik und Wahrheit; Springer: New York, NY, USA, 1936. [Google Scholar]
  20. Anderson, T.W.; Darling, D.A. Asymptotic Theory of Certain Goodness-of-Fit Criteria. Ann. Math. Stat. 1952, 23, 193–212. [Google Scholar] [CrossRef]
  21. Jarque, C.M.; Bera, A.K. Efficient Tests for Normality, Homoscedasticity and Serial Independence of Regression Residuals. Econ. Lett. 1980, 6, 255–259. [Google Scholar] [CrossRef]
  22. D’Agostino, R.B.; Pearson, E.S. Tests for Departure from Normality. Biometrika 1973, 60, 613–622. [Google Scholar]
  23. Epps, T.W.; Singleton, K.J. An omnibus test for the two-sample problem using the empirical characteristic function. J. Stat. Comput. Simul. 1986, 26, 177–203. [Google Scholar] [CrossRef]
  24. Filliben, J.J. The Probability Plot Correlation Coefficient Test for Normality. Technometrics 1975, 17, 111–117. [Google Scholar] [CrossRef]
  25. Shapiro, S.S.; Wilk, M.B. An Analysis of Variance Test for Normality (Complete Samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
  26. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  27. Székely, G.J.; Rizzo, M.L. Testing for Equal Distributions in High Dimension. InterStat 2004, 5, 1249–1272. [Google Scholar]
  28. Campbell, J.Y.; Lo, A.W.; MacKinlay, A.C.; Whitelaw, R.F. The econometrics of financial markets. Macroecon. Dyn. 1998, 2, 559–562. [Google Scholar] [CrossRef]
  29. Tsay, R.S. Analysis of Financial Time Series; John Wiley & Sons: Hoboken, NJ, USA, 2010. [Google Scholar]
  30. Ljung, G.M.; Box, G.E. On a measure of lack of fit in time series models. Biometrika 1978, 65, 297–303. [Google Scholar] [CrossRef]
  31. Ljung, L. Convergence analysis of parametric identification methods. IEEE Trans. Autom. Control 2003, 23, 770–783. [Google Scholar] [CrossRef]
  32. Scholz, F.W.; Stephens, M.A. K-sample Anderson–Darling tests. J. Am. Stat. Assoc. 1987, 82, 918–924. [Google Scholar]
  33. Good, P. Permutation, Parametric and Bootstrap Tests of Hypotheses; Springer: New York, NY, USA, 2005. [Google Scholar]
  34. Babu, G.J.; Rao, C.R. Goodness-of-fit tests when parameters are estimated. Sankhyā Indian J. Stat. 2004, 66, 63–74. [Google Scholar]
Figure 1. Cumulative log returns (stocks vs. benchmarks).
Figure 1. Cumulative log returns (stocks vs. benchmarks).
Symmetry 18 00943 g001
Figure 2. Goodness-of-Fit p-value distribution (IID innovations).
Figure 2. Goodness-of-Fit p-value distribution (IID innovations).
Symmetry 18 00943 g002
Figure 3. Pearson correlation matrix.
Figure 3. Pearson correlation matrix.
Symmetry 18 00943 g003
Figure 4. The 252-day rolling annualized volatility.
Figure 4. The 252-day rolling annualized volatility.
Symmetry 18 00943 g004
Figure 5. Sectoral Goodness-of-Fit heatmap (IID innovations).
Figure 5. Sectoral Goodness-of-Fit heatmap (IID innovations).
Symmetry 18 00943 g005
Figure 6. Return distribution overlap (standardized IID innovations).
Figure 6. Return distribution overlap (standardized IID innovations).
Symmetry 18 00943 g006
Figure 7. Hierarchical clustering based on ED_Shape and L-kurtosis.
Figure 7. Hierarchical clustering based on ED_Shape and L-kurtosis.
Symmetry 18 00943 g007
Figure 8. Density of Sharpe ratios for 1000 bootstrapped random portfolios.
Figure 8. Density of Sharpe ratios for 1000 bootstrapped random portfolios.
Symmetry 18 00943 g008
Table 1. Overview of Empirical Distribution Function (EDF) Goodness-of-Fit tests.
Table 1. Overview of Empirical Distribution Function (EDF) Goodness-of-Fit tests.
TestStatistic TypeSensitivityTypical Purpose
Kolmogorov–Smirnov (KS)Maximum distanceCentral deviationsGeneral distribution comparison
Cramér–von Mises (CvM)Integrated squared distanceGlobal deviationsOverall Goodness-of-Fit
Anderson–Darling (AD)Weighted integrated distanceTail deviationsExtreme-value analysis
Table 2. Overview of moment-based Goodness-of-Fit tests.
Table 2. Overview of moment-based Goodness-of-Fit tests.
TestMoment(s) UsedMain SensitivityTypical Purpose
Jarque–Bera (JB)Skewness and kurtosisGlobal asymmetry and excess kurtosisNormality testing in large samples, especially in econometrics and finance
D’Agostino–Pearson OmnibusTransformed skewness and kurtosisJoint shape and tail deviationsBalanced assessment of normality with improved finite-sample performance
Table 3. Stage 1: statistical test results for benchmark symmetry (IID innovations).
Table 3. Stage 1: statistical test results for benchmark symmetry (IID innovations).
StockIndexCorr_RAnn_VolLB_pKS_pAD_pCvM_pES_pNormality_p
AAPLS&P5000.7620.2750.9150.7780.4570.6240.284<0.001 ***
AAPLNASDAQ0.7790.2750.9150.6120.3460.5420.036 **<0.001 ***
AAPLDOW0.6510.2750.9150.8090.7210.7980.467<0.001 ***
TSLAS&P5000.5690.6010.1480.7780.4420.6190.273<0.001 ***
TSLANASDAQ0.6370.6010.1480.7130.4120.6670.034 **<0.001 ***
TSLADOW0.4310.6010.1480.8910.8670.9220.561<0.001 ***
JPMS&P5000.6230.2420.7080.8650.7280.8100.623<0.001 ***
JPMNASDAQ0.4730.2420.7080.7460.5680.6920.171<0.001 ***
JPMDOW0.7140.2420.7080.8090.6410.7730.573<0.001 ***
XOMS&P5000.3390.2680.5500.3170.1450.2250.144<0.001 ***
XOMNASDAQ0.1850.2680.5500.6460.2320.3810.122<0.001 ***
XOMDOW0.4080.2680.5500.6460.5150.4890.598<0.001 ***
JNJS&P5000.2140.1670.8090.6460.2370.3410.461<0.001 ***
JNJNASDAQ0.0810.1670.8090.3410.2250.3820.100 *<0.001 ***
JNJDOW0.3500.1670.8090.7130.5230.5390.574<0.001 ***
WMTS&P5000.3670.2100.6090.2310.1020.2190.058 *<0.001 ***
WMTNASDAQ0.2980.2100.6090.1640.093 *0.2320.006 ***<0.001 ***
WMTDOW0.4070.2100.6090.2930.1450.3060.046 **<0.001 ***
Notes: *, **, *** denote significance at the 10%, 5%, and 1% levels, respectively. LB_p refers to Ljung–Box test on innovations.
Table 4. Sectoral distributional similarity analysis (dual energy distance & IID innovations).
Table 4. Sectoral distributional similarity analysis (dual energy distance & IID innovations).
StockSectorCorrED_TotalED_ShapeLB_pKS_pCvM_pES_pJB_p
AAPLXLK0.7760.00540.04880.9110.2720.2300.020 **<0.001 ***
TSLAXLY0.7690.06410.05090.1430.1500.1740.016 **<0.001 ***
JPMXLF0.8570.01280.03850.7100.6460.5210.194<0.001 ***
XOMXLE0.9320.00400.02800.5560.6460.6960.883<0.001 ***
JNJXLV0.5900.00450.05380.8070.076 *0.1550.011 **<0.001 ***
WMTXLP0.5800.01640.08260.6180.056 *0.029 **<0.001 ***<0.001 ***
Notes: ED_Total = Energy Distance on raw innovations (scale + shape); ED_Shape = Energy Distance on standardized innovations (shape only); Corr = Pearson correlation; LB = Ljung–Box; KS = Kolmogorov–Smirnov; CvM = Cramér–von Mises; ES = Epps–Singleton; JB = Jarque–Bera. Significance: * p < 0.10 , ** p < 0.05 , *** p < 0.01 .
Table 5. Extreme risk profiling: L-moments and tail dynamics (IID innovations).
Table 5. Extreme risk profiling: L-moments and tail dynamics (IID innovations).
AssetZ-MinStd_SkewL-SkewL-KurtTail_L_5%Tail_R_5%Tail_L_1%Tail_R_1%Tail_Asym
AAPL−5.7476−0.0344−0.01240.19583.39952.73374.25102.92591.2435
TSLA−4.99530.0478−0.01010.19083.32743.34454.30412.71650.9949
JPM−5.6532−0.0045−0.04160.19562.94853.14603.43013.12710.9372
XOM−4.5039−0.2784−0.02880.15443.59604.33086.45958.94780.8303
WMT−10.2465−0.9708−0.03030.21842.88692.62412.06832.07861.1002
JNJ−7.95140.0141−0.00610.20173.34662.93394.60252.47731.1407
XLK−4.5571−0.3698−0.06000.15913.80474.76225.50304.06350.7989
XLY−4.2532−0.3006−0.05220.15194.10174.28694.91574.44840.9568
XLF−4.8286−0.2080−0.04310.16753.50583.70184.83094.65440.9471
XLE−6.6094−0.5489−0.05440.15803.75165.89884.09249.93890.6360
XLP−6.7821−0.4463−0.03600.15693.54794.20293.56194.05440.8442
XLV−6.5587−0.2864−0.02270.15454.36133.62783.69728.50691.2022
SPY−5.0765−0.5955−0.06820.17153.42444.45133.49555.35120.7693
QQQ−4.8335−0.4171−0.06520.16413.84164.24805.94144.49890.9043
DIA−4.6095−0.3155−0.04020.17233.63664.38844.52635.39020.8287
Table 6. Bootstrapped portfolio performance (1000 random portfolios).
Table 6. Bootstrapped portfolio performance (1000 random portfolios).
ModelSharpe RatioOutperformance RateExpected Shortfall (CVaR)
Random (mean)0.878
Model 1 (ED only)0.96858.5%−2.58%
Model 2 (ED + KS)1.10379.2%−2.70%
Model 3 (ED + KS + L-kurtosis)1.11280.3%−2.74%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sevin, A.; Bah, A.A. A Goodness-of-Fit Framework for Assessing Distributional Symmetry and Tail Asymmetry in Financial Equity Markets. Symmetry 2026, 18, 943. https://doi.org/10.3390/sym18060943

AMA Style

Sevin A, Bah AA. A Goodness-of-Fit Framework for Assessing Distributional Symmetry and Tail Asymmetry in Financial Equity Markets. Symmetry. 2026; 18(6):943. https://doi.org/10.3390/sym18060943

Chicago/Turabian Style

Sevin, Abdullah, and Alpha Abdoulaye Bah. 2026. "A Goodness-of-Fit Framework for Assessing Distributional Symmetry and Tail Asymmetry in Financial Equity Markets" Symmetry 18, no. 6: 943. https://doi.org/10.3390/sym18060943

APA Style

Sevin, A., & Bah, A. A. (2026). A Goodness-of-Fit Framework for Assessing Distributional Symmetry and Tail Asymmetry in Financial Equity Markets. Symmetry, 18(6), 943. https://doi.org/10.3390/sym18060943

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop