Diffusion-Based Parameters for Stock Clustering: Sector Separation and Out-of-Sample Evidence

Promsuwan, Piyarat; Khanarsa, Paisit; Chumpong, Kittisak

doi:10.3390/jrfm18110637

Open AccessArticle

Diffusion-Based Parameters for Stock Clustering: Sector Separation and Out-of-Sample Evidence

by

Piyarat Promsuwan

¹

,

Paisit Khanarsa

^2,*

and

Kittisak Chumpong

^1,3,4,*

¹

Division of Computational Science, Faculty of Science, Prince of Songkla University, Songkhla 90110, Thailand

²

Institute of Field Robotics, King Mongkut’s University of Technology Thonburi, Bangkok 10140, Thailand

³

Research Center in Mathematics and Statistics with Applications, Prince of Songkla University, Songkhla 90110, Thailand

⁴

Financial Mathematics, Data Science and Computational Innovations Research Unit (FDC), Department of Mathematics, Faculty of Science, Kasetsart University, Chatuchak, Bangkok 10900, Thailand

^*

Authors to whom correspondence should be addressed.

J. Risk Financial Manag. 2025, 18(11), 637; https://doi.org/10.3390/jrfm18110637

Submission received: 7 October 2025 / Revised: 7 November 2025 / Accepted: 8 November 2025 / Published: 12 November 2025

(This article belongs to the Special Issue Machine Learning-Based Risk Management in Finance and Insurance)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Clustering techniques are widely applied to equity markets to uncover sectoral structures and regime shifts, yet most studies rely solely on empirical returns. This paper introduces a novel perspective by using diffusion-based parameters from the Black–Scholes model, namely monthly drift and diffusion, as clustering features. Using SET100 stocks in 2020, we applied k-means clustering and evaluated performances with silhouette scores, the Adjusted Rand Index, Wilcoxon tests, and an out-of-sample portfolio exercise. The results showed that diffusion-based features achieved higher silhouette scores in turbulent months, where they revealed sectoral divergence that log-returns failed to capture. The partition for November 2020 provided clearer sector separation and smaller portfolio losses, demonstrating predictive value beyond in-sample fit. Practically, the findings indicate that diffusion-based parameters can signal early signs of market stress, guide sector rotation decisions during volatile regimes, and enhance portfolio risk management by isolating persistent volatility structures across sectors. Theoretically, this model-based framework bridges equity clustering with stochastic diffusion representations used in derivatives valuation, offering a unified and interpretable tool for data-driven market monitoring.

Keywords:

stock clustering; diffusion process; Black–Scholes model; log-return; silhouette coefficient; adjusted rand index; SET100 index

1. Introduction

Applying clustering techniques to stock groups or market indices can uncover underlying structures and interconnections within financial markets. Clustering has been used across diverse datasets, from global indices to individual stock exchanges, and has proven effective at detecting market regimes, sectoral relationships, and crisis dynamics. Methods such as k-means clustering, Gaussian mixture models, spectral clustering, and correlation-based approaches are commonly employed to analyze historical prices, returns, and market microstructure features. The insights obtained not only enhance the understanding of market dynamics but also support practical applications in portfolio diversification, index construction, and risk management (Aslam et al., 2023; Nagy & Ormos, 2018).

Empirical research on stock clustering has developed over several decades. Osborne (1962) was among the first to document clustering in closing prices on the New York Stock Exchange, and Niederhoffer (1965) later showed that such clustering varied by stock type. Subsequent studies provided evidence of persistent clustering across different markets, including the Australian Stock Exchange (Aitken et al., 1996), the London Stock Exchange (Grossman et al., 1997), the Singapore Exchange (Hameed & Terry, 1998), the Kuala Lumpur Stock Exchange (Chung et al., 2005), the Tokyo Stock Exchange (Aşçıoğlu et al., 2007), and the Chinese markets in Shanghai and Shenzhen (Brown & Mitchell, 2008). However, most of these studies focused on price or return data. The diffusion processes underlying stock price dynamics were only addressed in a few works, such as those of Hirukawa (2006) and De Gregorio and Iacus (2010).

A widely adopted feature of investment performance analysis is the log-return (LR), defined as the logarithmic difference of consecutive stock prices. LR clustering has been extensively used to detect patterns, anomalies, and common behaviors among assets, and has been combined with k-means variants, subspaces, and temporal clustering to enhance the analysis (Gan & Chen, 2016; Gramuglia et al., 2021). Despite the popularity of LR clustering as an investment analysis technique, it struggles with irregular temporal structures and often fails to capture deeper market dynamics due to its sensitivity to noise and outliers.

Alternatively, stock price movements can be modeled using Geometric Brownian Motion (GBM), a cornerstone of quantitative financial analysis. GBM assumes that the logarithm of stock prices follows a Brownian motion with drift, producing continuous paths characterized by two parameters: drift (expected return) and volatility (random fluctuations). It ensures positive prices, captures compounded returns, and underlies the Black–Scholes (BS) model. While more sophisticated models such as stochastic volatility or jump-diffusion processes have been proposed, classical GBM remains widely used due to its simplicity and analytical tractability (Chai, 2019).

Therefore, although stock clustering has been extensively examined in the literature, most existing studies continue to rely primarily on returns-based features. Analytical frameworks grounded in stochastic differential equations (SDEs), which have long provided closed-form foundations for modeling asset dynamics and option valuation (Cox et al., 1985; Duffie et al., 2003; Heston, 1993; Merton, 1973) have also been extended in recent research to regime-switching, trending volatility, and generalized diffusion settings (Chumpong et al., 2024, 2022; Duangpan et al., 2022; Rujivan, 2025; Rujivan et al., 2023, 2025; Sutthimat & Mekchay, 2022). This perspective highlights the relevance of parameter estimation beyond realized returns and motivates our use of BS-based features in capturing the underlying risk–return structure of stocks.

In particular, the drift and diffusion parameters from the BS framework offer a theoretically grounded representation of stock dynamics, yet have rarely been employed in clustering analysis. This study makes three key contributions. First, it shifts stock clustering from descriptive statistics (LR means and standard deviation) to structural parameters (drift and diffusion) that govern the stochastic process of prices, embedding clustering in the underlying dynamics rather than surface-level summaries. Second, by showing that diffusion-based features sharpen sector separation during market stress (e.g., COVID-19 shocks), the study demonstrates their value as regime-detection tools. Third, because the structural parameters are directly applied to derivatives pricing, the framework creates a novel bridge between equity clustering and derivatives valuation. Together, these aspects elevate diffusion-based clustering beyond a simple feature replacement to a new paradigm that integrates market structure analysis, stress detection, and the pricing of derivatives.

The remainder of this article is structured as follows. Section 2 details the data and methodology, including lognormality testing, feature extraction, and clustering evaluation. Section 3 reports the empirical findings, while Section 4 provides interpretation and discusses practical implications. Section 5 concludes with a summary of the study and suggestions for future research.

2. Model and Methodology

This section outlines the methodological framework of the study. All analyses were implemented in Python (version 3.13) using standard scientific libraries such as NumPy, pandas, SciPy, and scikit-learn. We begin by testing the lognormality of stock prices to ensure the appropriateness of using diffusion-based parameters. Next, we describe the feature extraction process, including the derivation of LRs and the estimation of drift and diffusion parameters from the BS model. These features are then standardized and used as inputs to k-means clustering. Finally, clustering performance is evaluated using internal validity indices and statistical tests.

2.1. Model

Black and Scholes (1973) proposed their famous and widely used stock price estimation model in 1973. Since the real market is too complex to be fully modelled, the model has been simplified. The dynamics of the stock price i for

i = 1, 2, \dots, N

,

S_{t}^{(i)}

is assumed to follow the SDE

\begin{matrix} d S_{t}^{(i)} = r_{i} S_{t}^{(i)} d t + σ_{i} S_{t}^{(i)} d W_{t}^{(i)}, \end{matrix}

(1)

where

r_{i}

is the risk-free interest rate,

σ_{i}

is the volatility of the stock price and

d W_{t}^{(i)}

is a Brownian motion in the probability space

(Ω, F, P)

in which

d W_{t}^{(i)}

and

d W_{t}^{(j)}

are assumed to be independent for

i \neq j

. In the resulting continuous-time model, the GBM of the stock price is given by

\begin{matrix} S_{t}^{(i)} = S_{0}^{(i)} exp (μ_{i} t + σ_{i} W_{t}^{(i)}) and μ_{i} = r_{i} - \frac{1}{2} σ_{i}^{2}, \end{matrix}

where

S_{0}^{(i)}

is the stock price at time 0. This means that

S_{t}^{(i)}

has a lognormal distribution (Hull & Basu, 2016) and in this research, we consider all stock prices to be lognormally distributed. Then, we estimate the corresponding parameters based on the historical data of the stock market in Thailand for group stocks by k-means clustering.

2.2. Testing of Price Distribution from SET100

We collected secondary data on closing prices of the constituents of the SET100 index from 2 January to 30 December 2020. The data comprises the 100 biggest companies in terms of market capitalisation as calculated by the SET100 index.

In general, two approaches can be taken to examine the lognormal distribution of stock prices: informal graphical methods and formal statistical hypothesis tests. While graphical tools such as histograms can provide intuition, they are insufficient for rigorous analysis. Therefore, this study adopted the Anderson–Darling (AD) test, which has been shown to be particularly effective for testing lognormality (Anderson & Darling, 1952; Aşçıoğlu et al., 2007; Tolikas & Heravi, 2008; Ul-Islam, 2011). To improve accuracy in small samples, we used the correction formulas for p-values proposed by D’Agostino and Stephens (1986). For each SET100 stock and each month of 2020, the null hypothesis of lognormality was tested at the 5% significance level. Stocks that passed the test were retained for subsequent parameter estimation and clustering analysis. The AD tests were performed using the scipy.stats.anderson function from the SciPy library.

This screening step, utilizing the AD test, ensures that the stock prices used for parameter estimation are consistent with the lognormal assumption fundamental to the BS framework. Although this procedure reduces the sample size in certain months, it preserves the theoretical integrity required for subsequent model-based estimation. This unavoidable trade-off between data coverage and theoretical validity was considered acceptable to ensure the reliability of the maximum-likelihood estimation (MLE) of the BS parameters under the GBM assumption.

2.3. Parameter Estimation

For clustering, we extract two sets of features: drift and diffusion from the BS model, and mean and standard deviation from empirical LRs. These complementary selections are detailed below.

2.3.1. Estimated Parameters from the Black–Scholes Model

There are many commonly used methods of estimating probability parameters. One of the most efficient methods is maximum likelihood estimation, which provides a consistent but flexible approach, suitable for a wide variety of applications. Furthermore, under standard regularity conditions, the resulting MLE is consistent, asymptotically normal, and asymptotically efficient, attaining the Cramer–Rao lower bound in large samples.

In the BS model, the dynamics of the stock price S are assumed to follow the parameters of constant drift r and volatility

σ

. To force the corresponding parameters to produce model dynamics as close as possible to the empirical stock market data, we use the techniques of Aït-Sahaliat-Sahalia (2002), Egorov et al. (2003), and Rujivan (2010). The transition probability density function of the BS model is denoted by

\begin{matrix} p (S^{(i)}, t ∣ S_{0}^{(i)}, τ; (σ_{i}, r_{i})) = \frac{1}{S_{t}^{(i)} \sqrt{2 π σ_{i}^{2} (t - τ)}} exp (- \frac{{(ln (\frac{S_{t}^{(i)}}{S_{τ}^{(i)}}) - (r_{i} - \frac{1}{2} σ_{i}^{2}) (t - τ))}^{2}}{2 σ_{i}^{2} (t - τ)}) . \end{matrix}

Let

S_{t_{j}}^{(i)}

be the closing stock price i observed at times

t_{j}

such that

t_{j} = j Δ t

with equally spaced discrete observations,

Δ t = t_{j} - t_{j - 1} > 0, j = 1, 2, \dots, N

for some positive integer N. Applying Bayes’s rule to the Markov process, for

S_{t}^{(i)}

, we get the following log-likelihood function,

\begin{matrix} L_{N} ((σ_{i}, r_{i})) = \sum_{j = 1}^{N} ln p (S_{t_{j}}^{(i)}, t_{j} ∣ S_{t_{j - 1}}^{(i)}, t_{j - 1}; (σ_{i}, r_{i})) . \end{matrix}

(2)

Maximizing

L_{N} ((σ_{i}, r_{i}))

over a particular parameter space

Θ = {(σ_{i}, r_{i}) \in R^{+} \times R for i = 1, 2, \dots, N}

, one can get an MLE

θ_{N}^{MLE}

which is a solution of the optimization problem,

\begin{matrix} θ_{N, i}^{MLE} = arg max_{(σ_{i}, r_{i}) \in Θ} L_{N} ((σ_{i}, r_{i})) . \end{matrix}

The log-likelihood function is approximated by

L_{N}^{(K, Δ t)} ((σ_{i}, r_{i}))

for some positive integer K. Replacing

L_{N} ((σ_{i}, r_{i}))

in Equation (2) with

L_{N}^{(K, Δ t)} ((σ_{i}, r_{i}))

and solving the problem, we obtain an approximate MLE

{\hat{θ}}_{N, i}^{MLE} = ({\hat{σ}}_{i}, {\hat{r}}_{i})

that converges in probability when K is large and

Δ

is small. The estimation steps are summarized in Algorithm 1.

Algorithm 1 Monthly GBM Parameter Estimation with Iterative Simulation and Categorization

Require: Set of months

M

; for each month M: list of stocks in SET100

1:: Daily close prices ${S_{t_{0}}^{(i)}, \dots, S_{t_{N}}^{(i)}}$ for each stock i in month M from the number of trading days N
2:: Time step $Δ t = 1 / N$ ; number of simulation rounds $B = 50, 000$

Ensure: For each stock i in each month M: estimated parameters

({\hat{σ}}_{i}, {\hat{μ}}_{i})

3:: for each month M do
4:: for each stock i in SET100 do
5:: Extract daily close prices ${S_{t_{0}}^{(i)}, \dots, S_{t_{N}}^{(i)}}$ in month M
6:: Estimate initial GBM parameters $(σ_{current}, r_{current})$ from observed prices
7:: Create empty lists to store $σ$ and r from simulations
8:: for $k \leftarrow 1$ to B do
9:: Simulate a price path using $(σ_{current}, r_{current})$ with the GBM model
10:: Estimate new parameters $(σ_{new}, r_{new})$ from the simulated path
11:: Save $σ_{new}$ and $r_{new}$ into the lists
12:: Update $(σ_{current}, r_{current}) \leftarrow (σ_{new}, r_{new})$
13:: end for
14:: Compute ${\hat{σ}}_{i} \leftarrow$ average of all $σ_{new}$ values
15:: Compute ${\hat{r}}_{i} \leftarrow$ average of all $r_{new}$ values
16:: Set ${\hat{μ}}_{i} \leftarrow {\hat{r}}_{i} - \frac{1}{2} {\hat{σ}}_{i}^{2}$
17:: end for
18:: Categorize stocks based on ${\hat{σ}}_{i}$ and ${\hat{μ}}_{i}$ values
19:: Store month-M results for further analysis
20:: end for

Although the BS parameters can be estimated approximately by maximum likelihood estimation, short monthly series may cause small-sample bias and instability. To improve the robustness of the method, we implemented an iterative simulation procedure (Algorithm 1) with B = 50,000 replications, which yields estimates close to those yielded by maximum likelihood estimation but more stable in practice.

The choice of the BS model, grounded in the GBM stochastic process, represents a deliberate methodological decision. As the simplest and most widely recognized continuous-time diffusion model in finance, the BS framework provides a tractable and theoretically consistent benchmark for parameter estimation. By linking drift and volatility to a structural stochastic process, it yields forward-looking measures of expected return and risk that serve as interpretable features for clustering analysis.

2.3.2. Estimated Parameters from Empirical Log Returns

The empirical LR approach provides a simple and model-free way to summarize the distributional characteristics of asset price changes over a given period. Unlike the BS estimation in the previous subsection, which relies on the continuous-time GBM assumption and maximizes a log-likelihood function, this method computes the parameters directly from observed price data without any diffusion model assumptions. It serves as a baseline for comparison with the BS parameter estimates.

Let

S_{t_{j}}^{(i)}

be the observed daily closing price of stock i on trading day

t_{j}

in month

M

, where

j = 1, 2, \dots, N

and N is the number of trading days in the month. The continuously compounded (log) return between

t_{j - 1}

and

t_{j}

is given by

X_{j}^{(i)} = ln (\frac{S_{t_{j}}^{(i)}}{S_{t_{j - 1}}^{(i)}}), j = 2, \dots, N .

Since there are N observed prices, the total number of log returns is

N - 1

.

The empirical mean and standard deviation of the log returns are computed as

{\bar{μ}}_{i} = \frac{1}{N - 1} \sum_{j = 2}^{N} X_{j}^{(i)}, {\bar{σ}}_{i} = \sqrt{\frac{1}{N - 2} \sum_{j = 2}^{N} {(X_{j}^{(i)} - {\bar{μ}}_{i})}^{2}} .

These two quantities provide a straightforward measure of the average return and volatility of stock i in month

M

. They are directly interpretable and can be calculated for any asset with sufficient historical data. Algorithm 2 provides the empirical LR baseline for comparison with the model-based BS estimation in Algorithm 1. Although the computation of

\bar{μ}

and

\bar{σ}

from historical LRs is a standard procedure, its explicit inclusion clarifies the methodological contrast between data-driven and diffusion-based parameter estimation and ensures reproducibility within the monthly clustering framework.

Algorithm 2 Monthly Empirical LR Parameters Calculation

Require: Set of months

M

; for each month M: list of stocks in SET100

1:: Daily close prices ${S_{t_{1}}^{(i)}, \dots, S_{t_{N}}^{(i)}}$ for each stock i from the number of trading days N

Ensure: For each stock i in each month M: estimated parameters

({\bar{σ}}_{i}, {\bar{μ}}_{i})

2:: for each month M do
3:: for each stock i in SET100 do
4:: Extract daily close prices for stock i in month M
5:: Compute log returns $X_{j}^{(i)} = ln (S_{t_{j}}^{(i)} / S_{t_{j - 1}}^{(i)})$ for $j = 2$ to N
6:: Calculate ${\bar{μ}}_{i}$ and ${\bar{σ}}_{i}$
7:: end for
8:: Categorize stocks based on ${\bar{σ}}_{i}$ and ${\bar{μ}}_{i}$ values
9:: Store month-M results for further analysis
10:: end for

In the clustering process, the empirical values

({\bar{σ}}_{i}, {\bar{μ}}_{i})

are compared with the corresponding parameters

({\hat{σ}}_{i}, {\hat{μ}}_{i})

estimated from the BS model. Prior to clustering, features are z-score standardized within each month M across the n available stocks. Standardization is applied separately to the BS features

({\hat{σ}}_{i}, {\hat{μ}}_{i})

and to the empirical LR features

({\bar{σ}}_{i}, {\bar{μ}}_{i})

, using the same intersection of tickers in month M.

2.4. K-Means Clustering for Parameters

For each month, the estimated parameters of the n stocks can be expressed as a set of vectors

{({\hat{a}}_{1}, {\hat{b}}_{1}), ({\hat{a}}_{2}, {\hat{b}}_{2}), \dots, ({\hat{a}}_{n}, {\hat{b}}_{n})}

. The K-means algorithm partitions these n vectors into K disjoint clusters

S = {S_{1}, S_{2}, \dots, S_{K}}

, each represented by a centroid

C = {(a_{1}, b_{1}), (a_{2}, b_{2}), \dots, (a_{K}, b_{K})}

. The objective is to minimize the total distance between parameter vectors and their assigned centroids:

\begin{matrix} min_{S, C} \sum_{k = 1}^{K} \sum_{i \in S_{k}} \sqrt{{({\hat{a}}_{i} - a_{k})}^{2} + {({\hat{b}}_{i} - b_{k})}^{2}} . \end{matrix}

Cluster centroids are initialized randomly and updated iteratively until convergence (Ball & Hall, 1967; MacQueen, 1962). The final result is a partition of the parameter space into K clusters with corresponding centroids.

To evaluate the quality of clustering, we use the silhouette coefficient, which compares the average intra-cluster distance of each point with the nearest-cluster distance. For a given point

({\hat{a}}_{i}, {\hat{b}}_{i})

, the silhouette value is defined as

\begin{matrix} s_{K} ({\hat{a}}_{i}, {\hat{b}}_{i}) = \frac{b_{K} ({\hat{a}}_{i}, {\hat{b}}_{i}) - a_{K} ({\hat{a}}_{i}, {\hat{b}}_{i})}{max {a_{K} ({\hat{a}}_{i}, {\hat{b}}_{i}), b_{K} ({\hat{a}}_{i}, {\hat{b}}_{i})}}, \end{matrix}

where

a_{K} (\cdot)

is the mean distance to other samples in the same cluster and

b_{K} (\cdot)

is the minimum mean distance to samples in any other cluster. The silhouette coefficient for the partition with K clusters is then

\begin{matrix} Sil (K) = \frac{1}{n} \sum_{i = 1}^{n} s_{K} ({\hat{a}}_{i}, {\hat{b}}_{i}), \end{matrix}

which lies between

- 1

and 1. The value of K that maximizes

Sil (K)

is typically chosen as the most appropriate number of clusters (Rousseeuw, 1987).

2.5. Adjusted Rand Index

To quantify the agreement between clusterings derived from different parameterizations of the same stocks (e.g., k-means on BS parameters

({\hat{σ}}_{i}, {\hat{μ}}_{i})

versus k-means on empirical LR statistics

({\bar{σ}}_{i}, {\bar{μ}}_{i})

), we employ the Adjusted Rand Index (ARI) (Hubert & Arabie, 1985; Rand, 1971). For each month M, let n denote the number of stocks under comparison (we use the intersection of tickers available in both representations). Consider two partitions of

{1, \dots, n}

,

S = {S_{1}, \dots, S_{K}}

and

T = {T_{1}, \dots, T_{L}}

, with nonempty, disjoint clusters. The contingency counts

N_{k ℓ} = | S_{k} \cap T_{ℓ} |

, the row sums

a_{k} = \sum_{ℓ = 1}^{L} N_{k ℓ}

, and the column sums

b_{ℓ} = \sum_{k = 1}^{K} N_{k ℓ}

. The ARI is given in closed form by

ARI (K, L) = \frac{\sum_{k = 1}^{K} \sum_{ℓ = 1}^{L} (\binom{N_{k ℓ}}{2}) - \frac{1}{(\binom{n}{2})} (\sum_{k = 1}^{K} (\binom{a_{k}}{2})) (\sum_{ℓ = 1}^{L} (\binom{b_{ℓ}}{2}))}{\frac{1}{2} (\sum_{k = 1}^{K} (\binom{a_{k}}{2}) + \sum_{ℓ = 1}^{L} (\binom{b_{ℓ}}{2})) - \frac{1}{(\binom{n}{2})} (\sum_{k = 1}^{K} (\binom{a_{k}}{2})) (\sum_{ℓ = 1}^{L} (\binom{b_{ℓ}}{2}))} .

In

ARI \in [- 1, 1]

,

ARI = 1

indicates that partitions are identical (perfect agreement), values near 0 indicate the chance-level agreement of partitions, and negative values indicate less agreement than expected by chance. The ARI is permutation–invariant and defines arbitrary cluster cardinalities

(K, L)

well. In our empirical analyses, we set

K = L

and report

ARI (M, K)

for

K \in {2, \dots, 10}

in month M.

2.6. Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is a nonparametric method for paired comparisons that does not require normality and is well suited to small samples. It is therefore appropriate for assessing the differences in clustering quality between the two feature sets (Woolson, 2007).

In our analysis, the null assumption is that the median difference in silhouette scores between BS and Empirical LR features is zero, meaning that both approaches perform equally well. We first applied a two-sided test at the 5% level to check for any difference. When this was significant, we followed with a one-sided test to evaluate whether clustering based on BS features provided higher quality than clustering based on LRs. To assess the robustness of the directional hypothesis, the Bonferroni correction (

α_{adj} = 0.05 / 12 \approx 0.0042

) was applied to the one-sided tests, controlling the Family-Wise Error Rate (FWER) across the 12 monthly comparisons.

2.7. Overall Monthly Clustering Pipeline

The entire procedure is organized as a monthly pipeline. First, daily closing prices are screened for log-normality using the AD test, and only stocks that pass are retained. For these stocks, two feature sets are obtained: BS parameters estimated by iterative simulation and maximum likelihood, and empirical parameters computed directly from LRs. Both feature sets are standardized within each month and clustered separately by k-means over candidate values of K. Clustering quality is evaluated by the silhouette coefficient, and agreement between the two representations is quantified by the ARI. Cluster quality is compared within each month using paired Wilcoxon tests on silhouette values. If the two-sided test is significant, a one-sided test assesses whether BS parameters outperform LRs. Algorithm 3 summarizes the full workflow in pseudocode form.

Algorithm 3 Monthly Clustering Pipeline with ARI and Wilcoxon Tests

Require: Set of months

M

; for each month M: list of SET100 stocks; significance level

α = 0.05

for all statistical tests; candidate cluster sizes

K \in {2, \dots, 10}

Ensure: For each M: standardized features on the screened set

I_{M}

; k-means labels;

ARI (M, K)

; and Wilcoxon p-values

1:: for each month $M \in M$ do
2:: Screening: compute daily LRs for each stock; test log-normality at level $α$ ; let $I_{M}$ be the set that passes
3:: BS parameters: for each $i \in I_{M}$ , run Algorithm 1 to obtain $({\hat{σ}}_{i}, {\hat{μ}}_{i})$
4:: Empirical LR parameters: for each $i \in I_{M}$ , run Algorithm 2 to obtain $({\bar{σ}}_{i}, {\bar{μ}}_{i})$
5:: Standardize: z-score both feature sets within month M over the same index $I_{M}$
6:: initialize empty list $D$
7:: for each $K = 2, \dots, 10$ do
8:: Fit K-means on standardized BS features → labels $z^{BS} (M, K)$
9:: Fit K-means on standardized LR features → labels $z^{LR} (M, K)$
10:: Compute $ARI (M, K)$ from $z^{BS} (M, K)$ and $z^{LR} (M, K)$
11:: Compute ${Sil}^{BS} (M, K)$ from Silhouette of $z^{BS} (M, K)$
12:: Compute ${Sil}^{LR} (M, K)$ from Silhouette of $z^{LR} (M, K)$
13:: $D_{K} \leftarrow {Sil}^{BS} (M, K) - {Sil}^{LR} (M, K)$ ; append $D_{K}$ to $D$
14:: end for
15:: Wilcoxon signed-rank test (within month M):
16:: Compute two-sided Wilcoxon on $D$ $\to p_{two-sided} (M)$
17:: if $p_{two-sided} (M) < α$ then
18:: Compute one-sided Wilcoxon on $D$ $\to p_{greater} (M)$
19:: end if
20:: Store all outputs for month M
21:: end for

3. Results

This section presents the empirical findings of the study. We begin with data screening and summary statistics of the estimated parameters derived from the SET100 constituents, followed by an evaluation of clustering performances using both BS parameters and empirical LRs. The degree of agreement between the two approaches is then assessed through the ARI, and statistical comparisons of cluster quality are carried out using the Wilcoxon signed-rank test. Finally, we provide out-of-sample evidence based on cumulative returns to evaluate the practical implications of the clustering outcomes.

3.1. Data Screening and Parameter Summary Statistics

Daily closing prices of SET100 stocks in 2020 were tested for lognormality using the AD test at the 5% level. Only stocks that passed were retained for parameter estimation and clustering, resulting in varying sample sizes across months. In addition, a few outliers were excluded: CRC and BAM in February, STA and PRM in March, STA in May, THANI in August, and DELTA in December. The final counts of retained stocks (the number of stocks that passed the lognormality test, denoted by N) are reported in Table 1.

Table 1 reports monthly summary statistics of the estimated parameters. For each month, both model-based parameters

(\hat{μ}, \hat{σ})

from the Black–Scholes framework and empirical statistics

(\bar{μ}, \bar{σ})

from log returns are summarized in terms of their mean, standard deviation, minimum, maximum, skewness, and kurtosis across all retained stocks.

Overall, the estimated drifts

\hat{μ}

were mostly negative during the first quarter, consistent with the COVID-19 shock, and became positive in April and November, reflecting brief market recoveries. Volatility parameters

\hat{σ}

peaked in March, July, and December, indicating turbulent periods with heavy-tailed behavior, as confirmed by elevated skewness and kurtosis values in those months. The empirical moments

(\bar{μ}, \bar{σ})

displayed similar but less pronounced patterns, suggesting that the model-based estimation captured stronger dispersion dynamics than the raw return measures. These descriptive patterns reveal alternating episodes of contraction and rebound throughout 2020, providing a contextual backdrop for the clustering analysis in the following section. These features collectively capture market heterogeneity over time and motivate the use of clustering techniques to identify distinct parameter-based stock groupings.

3.2. Clustering Performance

Figure 1 and Figure 2 display the silhouette scores of k-means clustering across different numbers of clusters (

k = 2, \dots, 10

), separated into the first and second halves of the year, respectively. The results show that silhouette values varied considerably across months and cluster sizes. In some months, clustering based on the BS parameters produced higher scores, indicating more cohesive and well-separated groups. In other months, the results were similar, highlighting the inconsistency of the relative performances of the two approaches throughout the year.

To complement Figure 1 and Figure 2 and to enhance the visual interpretation of month-by-month variations, an additional Difference Heatmap (Figure 3) is presented. This figure directly displays the differences in silhouette scores (BS-LR) across all months and cluster numbers (

k = 2, \dots, 10

), allowing a clearer high-level comparison between the two feature sets. The heatmap highlights the specific months and cluster sizes where BS-based clustering achieves stronger cohesion, while confirming that neither approach consistently dominates throughout the year.

Figure 4 provides an overall summary by presenting the monthly average silhouette scores, which allow a clearer comparison between the two approaches. On average, the BS parameterization tended to yield slightly higher silhouette scores, although the advantage was not consistent across months.

It should be noted that the number of stocks included in each month was not identical, as only those passing the lognormality screening and not identified as outliers were retained (see Table 1). Since higher silhouette values reflect more compact and well-separated clusters, the results suggest that BS parameterization captured additional structure in stock behavior beyond that conveyed by simple LR statistics, although the advantage was not uniform across all months.

3.3. Clustering Agreement

Figure 5 displays ARI values across cluster sizes (

K = 2, \dots, 10

) for each month. The results indicate that the level of agreement was most often moderate to high. In several months, ARI values reached relatively high levels, showing that BS parameters and empirical LRs produced broadly consistent partitions. In other months, the ARI values were moderate, reflecting partial but meaningful consistency. Truly low values were observed only occasionally, indicating that the two approaches rarely produced completely divergent cluster structures.

Figure 6 presents the monthly average ARI values, providing a clearer overall comparison. Most months show average agreement in the moderate-to-high range, with a few months falling lower. Taken together, Figure 5 and Figure 6 indicate that the clustering results from BS parameters and empirical LRs were broadly consistent, although the degree of correspondence varied across months and cluster sizes. Although the moderate-to-high ARI values indicate broad agreement in overall cluster boundaries, the additional structure captured by the BS-based features, which yields higher silhouette scores, appears primarily as tighter within-cluster cohesion rather than fundamentally different stock groupings. This suggests that the diffusion parameters effectively filtered noise and enhanced the definition of existing clusters.

3.4. Statistical Comparison of Cluster Quality

Wilcoxon signed-rank tests were applied to the silhouette values across cluster sizes (

K = 2, \dots, 10

) for each month. The results are shown in Table 2. The two-sided tests indicated no significant differences in most months (

p > 0.05

), except for August (

p = 0.0078

) and November (

p = 0.0039

). For these two months, one-sided tests confirmed that clustering based on BS parameters achieved significantly higher silhouette scores (August

p = 0.0039

, November

p = 0.0020

). Furthermore, after applying the Bonferroni correction for multiple comparisons on the one-sided tests (

α_{adj} \approx 0.0042

), the significant outperformance of BS features was confirmed in both August and November, confirming the robustness of the advantage. Overall, clustering quality was generally comparable between the two approaches, with the diffusion-based representation showing a clear advantage in these specific months.

3.5. Out-of-Sample Cumulative Return Analysis

To evaluate whether the clustering outcomes contained predictive information, we examined the subsequent cumulative returns of selected clusters. We began with August 2020, where BS-based clustering achieved a significantly higher silhouette score than LR-based clustering. In this month, the highest silhouette value occurred at

K = 2

, but such a coarse partition provided limited insight, as nearly all stocks were grouped into only two broad clusters. The second-best value was obtained at

K = 6

, which maintained a competitive silhouette score while offering a more granular structure that facilitated interpretation. In practice, however, both BS and LR produced identical cluster assignments at

K = 6

, so the subsequent cumulative returns in September were indistinguishable. This underscores that higher silhouette values do not always translate into practical differences when the resulting partitions coincide.

In contrast, November 2020 revealed a stronger divergence. In this case, the highest silhouette values were found at

K = 2

and

K = 3

, but such coarse partitions provided limited insight as they fail to capture meaningful sectoral structure for investment practice. The third-highest value was at

K = 8

, which was selected as a representative case study that offered a reasonable balance between statistical cohesion and economic granularity. More importantly, at

K = 8

, BS and LR produced visibly different memberships, providing a natural candidate for out-of-sample testing. Specifically, we focused on the two clusters in November 2020 (

K = 8

) where BS and LR memberships diverged. Each cluster was represented by its five nearest-to-centroid stocks. Figure 7 compares clustering results based on BS (left) and LR (right) at

K = 8

for November 2020. Within the zoomed region, only the ten centroid-nearest representatives (five per cluster) are highlighted (red = Group A, blue = Group B), while all other stocks are plotted faintly. Under BS parameterization, CPF appears in Group A and GFPT in Group B, whereas under LR parameterization, PTG replaces CPF in Group A and BPP replaces GFPT in Group B.

We then evaluated whether these membership differences carried predictive value by computing cumulative returns in the subsequent month (December 2020). For each method, we tracked the five representative stocks (those closest to the cluster centroid) identified in Figure 7 and calculated their portfolio-level growth. Cumulative returns were computed from simple daily returns according to

R = \prod_{t = 1}^{N} (1 + r_{t}) - 1,

where

r_{t}

denotes the daily return and N the number of trading days in December. To complement this analysis, we also evaluated the risk-adjusted performance of the representative portfolios by computing their Sharpe ratios. The Sharpe ratio was defined as the mean daily portfolio return divided by its standard deviation, assuming a zero risk-free rate, thereby quantifying the average excess return earned per unit of risk. Table 3 thus reports both the cumulative returns and the corresponding Sharpe ratios for each cluster and method.

Portfolios were constructed as equal-weighted baskets of the five centroid-nearest representatives, rebalanced at the beginning of the subsequent month and held for the full duration of the month. Transaction costs, taxes, and dividends were ignored to focus on the comparative effect of clustering assignments. All selected stocks are highly liquid constituents of the SET100 index; therefore, any transaction cost differences are negligible and would not materially alter the relative performance between BS- and LR-based portfolios.

The contrast is clear. In the first case, BS included CPF (a defensive food stock) instead of PTG (an energy stock with higher volatility), cutting the average loss from

- 4.38 %

to

- 3.44 %

. In the second, BS selected GFPT (a food sector stock) rather than BPP (a power producer), reducing the loss from

- 2.33 %

to

- 1.01 %

. Beyond raw returns, the Sharpe ratios also improved from

- 0.16

to

- 0.12

in Cluster A and from

- 0.04

to

- 0.01

in Cluster B, confirming that BS-based clusters achieved superior risk-adjusted performance. These results indicate that the diffusion-based parameterization yields not only structurally coherent clusters but also portfolios that remain more resilient to market turbulence.

Thus, while August with

K = 6

showed no predictive differences because BS and LR partitions were identical, November with

K = 8

clearly showed that diffusion-based features can form clusters with more resilient out-of-sample performances, strengthening the practical value of the BS parameterization.

4. Discussion

4.1. Summary of Findings

The analysis yielded five main results. First, stock returns in 2020 varied substantially, with volatility spikes and heavy tails in February, March, July, and December. Second, clustering with BS parameters generally produced higher silhouette scores than LRs, though not in every month. Third, ARI values showed moderate to high agreement, indicating broadly similar partitions from both feature sets. Fourth, Wilcoxon tests identified significant BS gains only in August and November. Finally, the out-of-sample test confirmed that the November BS partition achieved clearer sector separation and smaller portfolio losses, while predictive differences were limited in other months.

4.2. Interpretation of Results

To further interpret the clustering results, we conducted a supplementary analysis of cross-sector divergence. For each month, we computed market volatility as the cross-sectional mean of all stock volatilities, and compared this with the average volatility within each sector. The absolute difference between these two values was taken as a measure of how much each sector diverged from the overall market. This provided a simple way to capture heterogeneity in sectoral behavior.

The analysis revealed that August and November stood out, as the BS parameterization consistently produced larger sector-to-market divergences across almost all sectors than the empirical LR parameterization. In other words, BS features highlighted sectoral differences more strongly—an outcome that was consistent with the significantly higher silhouette scores observed in these months. Both months also coincided with major COVID-19 developments—the domestic second wave in August and the announcement of effective vaccines in November—which triggered heterogeneous sector responses. These conditions likely contributed to the superior clustering performance of the BS features during these months. These findings suggest that the advantages of the BS parameterization emerge particularly under market conditions characterized by heightened sectoral divergence.

Specifically, the BS diffusion volatility was estimated via maximum likelihood under the assumption of a continuous diffusion process. This estimation acted as a statistical filter that smoothed transitory noise and outliers in daily returns, providing a more stable and structural representation of the underlying volatility dynamics. Such stability allowed the clustering algorithm to capture persistent cross-sectional differences among stocks rather than reacting to short-lived fluctuations, which explained the superior cohesion and sectoral interpretability observed in BS-based clusters.

Although the empirical LR statistics also represent a form of risk–return trade-off, they are inherently backward-looking, capturing realized outcomes within each month. In contrast, the BS-based parameters are estimated under a continuous diffusion framework that provides a theoretically consistent and forward-looking representation of price dynamics. Rather than summarizing past data, these parameters reflect the market’s expected risk–return structure and filter out transitory noise, allowing the clustering to capture persistent differences in underlying diffusion regimes.

4.3. Statistical and Econometric Perspectives on Robustness

From a statistical standpoint, conducting multiple hypothesis tests across twelve monthly directional evaluations increased the likelihood of Type I errors. To control the FWER appropriately, a Bonferroni correction was applied to the one-sided Wilcoxon signed-rank results, setting the adjusted significance level at

α_{adj} = 0.0042

. After this adjustment, the significant outperformance of BS features was confirmed in both August and November. This result refines the inference while fully confirming the robustness of the study’s original finding of two significant months.

From an econometric perspective, however, these monthly tests are not fully independent because the same stock universe contributes to multiple cross-sections. Hence, the Bonferroni adjustment represents a conservative upper bound on the true FWER. Even under this stringent criterion, the persistence of the confirmed significance in both months strongly supports the economic interpretation that BS-based clustering remains more resilient and structurally informative in high-volatility market regimes.

Although the original analysis utilized the unadjusted 5% level for primary inference, this correction serves as a necessary robustness confirmation that validates the study’s empirical conclusions against a rigorous statistical standard.

4.4. Predictive Value of Clustering

The out-of-sample evidence provided an important complement to the statistical evaluation of clustering quality. In August (

K = 6

), BS and LR produced identical partitions, indicating that a higher silhouette score alone does not guarantee predictive relevance when group memberships coincide. By contrast, in November (

K = 8

), BS and LR generated distinct partitions, and the BS-based clusters aligned with more resilient performance in the subsequent month. This contrast suggests that diffusion-based features are particularly valuable under conditions of sectoral divergence and market stress, where volatility structures play a critical role in separating stock behaviors. More broadly, the findings underscore that internal validity metrics such as silhouette scores, while useful, are not sufficient on their own; economic interpretation and out-of-sample confirmation are essential to establish practical value. These insights set the stage for a discussion of how clustering outcomes can inform market analysis and portfolio strategies, which follows in the next section.

4.5. Practical Implications for Market Analysis

Building on the evidence that BS-based clustering can reveal predictive value during periods of sectoral divergence, several practical implications emerge for market practitioners. First, the use of diffusion-based parameters provides an alternative feature space for clustering stocks, which can uncover sectoral structures that may not be visible when relying solely on LRs. This is particularly valuable during periods of market stress or structural change, when volatility heterogeneity across sectors becomes more pronounced.

Second, identifying months in which BS parameterization yields clearer clusters can help analysts recognize when volatility-driven factors dominate market behavior. Such insights may support sector rotation strategies, risk management practices, and the design of diversified portfolios that adapt to changing market regimes. To make this insight operational, we propose a Market-Regime Decision Framework to guide when the BS-based or LR-based clustering should be applied. The framework uses the outcome of the Wilcoxon signed-rank test as a market-state indicator:

Market-Stress Regime (BS Superior): When the Wilcoxon test comparing BS- and LR-based silhouette scores is statistically significant (e.g., $p < α_{a d j}$ ), it signals that the market’s volatility structure is undergoing sectoral divergence. In this regime, the BS-based clustering should be adopted, as it better captures the underlying structural heterogeneity.
Normal-Market Regime (Comparable): When the test is not significant ( $p \geq α_{a d j}$ ), the market is deemed stable or less divergent. In this case, the simpler LR-based clustering may be preferred to maximize operational stability and computational efficiency, as the LR results are less prone to implementation error and avoid unnecessary reliance on simulation-based parameter estimation when its incremental value is minimal.

This rule-based procedure provides a data-driven guideline for practitioners to determine when diffusion-based clustering should be prioritized, offering an executable bridge between empirical results and real-world decision-making.

From a practical standpoint, Algorithm 1 requires 50,000 Monte Carlo simulations per stock per month, which entails a higher computational burden than directly computing historical means and standard deviations. Nevertheless, this additional cost was moderate and justified by the methodological benefits. The simulation-based estimation produced stochastically robust parameters that more effectively captured the underlying return distribution, yielding clearer and more stable clusters during volatile market periods. Ultimately, the incremental computational effort represents a necessary and reasonable trade-off between efficiency and methodological robustness.

Finally, the methodological framework developed here—combining model-based parameter estimation, clustering evaluation, and statistical comparison—can be extended beyond equity markets. For example, diffusion-based features could decipher the clustering of derivative instruments or credit products, offering a richer perspective for both academic research and applied financial analysis.

4.6. Comparison with Previous Studies

Previous research has commonly compared clustering methods based on empirical returns or alternative statistical transformations, but few studies have explicitly contrasted diffusion-based parameterizations with raw LRs. Most applications of the BS framework have focused on derivatives pricing and risk management rather than unsupervised learning of equity structures. In this sense, our work contributes a novel perspective by showing that parameters estimated from a diffusion model can also serve as effective clustering features. While earlier studies generally reported limited differences between returns-based representations, our findings highlight that BS features may offer advantages under specific market conditions characterized by sectoral divergence. This approach complements rather than contradicts the existing literature, suggesting that diffusion-based features provide additional explanatory power in settings where volatility heterogeneity is central. These insights not only extend the clustering literature but also reinforce the practical implications discussed above, where volatility-sensitive features can play a decisive role in portfolio analysis.

4.7. Broader Applications: Derivatives Pricing

Beyond clustering analysis, the estimation of BS parameters has direct relevance for derivatives pricing. The drift and diffusion terms obtained from our procedure are precisely those required in the valuation of European-style options under the BS framework. This alignment suggests that the parameter estimation approach developed here can serve a dual purpose: not only as an input for unsupervised learning, but also as a foundation for pricing and hedging derivative securities.

Moreover, the ability of the BS features to capture sectoral heterogeneity suggests that option-implied valuations based on these parameters may more accurately reflect differences in risk between sectors. This can be valuable for practitioners who need to price derivatives or manage positions under conditions of market stress, where volatility dynamics play a central role. Crucially, the clustering outcomes themselves carry practical implications for derivative pricing and portfolio risk control. By grouping stocks into diffusion-based clusters that share similar drift–volatility dynamics, market participants can identify coherent volatility regimes that may guide the calibration of implied volatility surfaces or correlation structures in multi-asset options. Tracking transitions in cluster composition over time can also serve as an early signal of regime shifts in volatility, enabling more responsive adjustments to hedging ratios or option portfolio exposures. Taken together, the clustering results provide a bridge between statistical segmentation and actionable decisions in derivative pricing and risk management.

In this way, the methodological framework proposed in this study contributes both to market structure analysis and to traditional applications of stochastic diffusion models in finance. While the immediate application is option valuation, the same diffusion parameters can also underpin other derivatives such as volatility swaps or moment swaps, where accurate volatility modeling is critical.

4.8. Study Limitations and Future Work

This study has several limitations that should be acknowledged. First, the analysis was restricted to the SET100 constituents within a single year (2020), which may reflect COVID-specific market behavior and limit generalizability. Future research could extend the framework to longer time horizons or to different markets in order to assess robustness across economic regimes. Second, the BS parameters were estimated using the classical BS diffusion as a simplified baseline for parameter estimation. While this model ensured analytical tractability, it did not fully capture stylized facts of financial time series such as volatility clustering, heavy tails, or price jumps. Future extensions could therefore incorporate richer stochastic processes such as stochastic volatility, jump-diffusion, or mean-reverting models to provide more realistic feature representations for clustering. In particular, parameters such as the volatility-of-volatility in stochastic-volatility models or the jump-intensity in jump-diffusion settings could be evaluated to determine whether the observed improvements arose from model specification itself or from the broader use of structural diffusion parameters as clustering features. Third, months with severely reduced samples (e.g., March 2020) should be interpreted with caution due to limited statistical power, as smaller cross-sectional sets inherently reduce the reliability of cluster-level statistical comparisons. Future research may relax the lognormality screening threshold (e.g.,

α = 0.10

) to examine the robustness of the proposed clustering framework under less restrictive distributional assumptions. Fourth, the clustering analysis was based solely on k-means and a limited set of evaluation metrics. While this choice provided interpretability and comparability, alternative clustering methods, such as hierarchical or density-based approaches, and complementary validity indices could yield additional insights. Finally, although we discussed potential applications for market analysis and derivatives pricing, this study did not directly evaluate the out-of-sample predictive utility of the clusters. Future work could therefore investigate how diffusion-based features influence portfolio performance, risk forecasting, or option market dynamics in practice.

5. Conclusions

This study investigated whether parameters estimated from the Black–Scholes diffusion model can serve as informative features for stock clustering, using SET100 constituents in 2020 as a case study. Compared with empirical log-returns, Black–Scholes parameters often produced higher silhouette scores, although the advantage was mainly concentrated in periods of market stress such as August and November 2020. Clustering agreement, as measured by the Adjusted Rand Index, was moderate to high, suggesting that the two approaches captured broadly similar structures in most months. Statistical testing confirmed significant gains for Black–Scholes features only in selected months, yet the out-of-sample analysis demonstrated that the November 2020 Black–Scholes partition achieved clearer sector separation and smaller portfolio losses than its log-returns counterpart.

The novelty of this work lies in extending stock clustering beyond traditional returns-based features to diffusion-based parameters that capture the underlying price-generating process. This model-based perspective revealed sectoral divergences that are often obscured in empirical returns, particularly during turbulent market regimes. The inclusion of risk-adjusted portfolio evaluation through Sharpe ratios and a simple Market-Regime Decision Framework further strengthens its practical relevance. Beyond its methodological contribution, the study carries broader implications for financial practice—diffusion-based features can provide early warning signals of market stress, inform sector rotation strategies, and strengthen risk management frameworks. More generally, the findings demonstrate how incorporating model-based representations of stock prices can enrich our understanding of market structure and sectoral behavior across different economic conditions.

Author Contributions

Conceptualization, K.C.; methodology, P.P., P.K. and K.C.; software, P.P. and P.K.; validation, P.P., P.K. and K.C.; formal analysis, P.P., P.K. and K.C.; investigation, P.P., P.K. and K.C.; resources, K.C.; data curation, P.K. and K.C.; writing—original draft preparation, P.P., P.K. and K.C.; writing—review and editing, P.K. and K.C.; visualization, P.P. and P.K.; supervision, P.K. and K.C.; project administration, K.C.; funding acquisition, P.P. and P.K. All authors have read and agreed to the published version of the manuscript.

Funding

Paisit Khanarsa would like to gratefully acknowledge the Institute of Field Robotics, King Mongkut’s University of Technology Thonburi, for providing financial support toward the article processing charge of this publication.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The stock price data analyzed in this study were obtained from a publicly accessible source. The processed dataset used for analysis, together with parameter estimates and Python codes, is available from the corresponding authors upon reasonable request.

Acknowledgments

Piyarat Promsuwan was supported by the Graduate Fellowship (Research Assistant), Faculty of Science, Prince of Songkla University, Contract No. 1-2568-02-035. We are also grateful to Thomas Duncan Coyne for his invaluable assistance in refining the English language of this manuscript. We also thank the anonymous referees for their constructive comments, which have substantially improved the quality and clarity of the paper. All remaining errors are solely the responsibility of the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LR	Log-Return
GBM	Geometric Brownian Motion
BS	Black–Scholes
SDE	Stochastic Differential Equation
SET100	Stock Exchange of Thailand 100 Index
AD	Anderson–Darling
MLE	Maximum Likelihood Estimation/Estimator
ARI	Adjusted Rand Index
FWER	Family-Wise Error Rate

References

Aitken, M., Brown, P., Buckland, C., Izan, H. Y., & Walter, T. (1996). Price clustering on the Australian stock exchange. Pacific-Basin Finance Journal, 4(2–3), 297–314. [Google Scholar] [CrossRef]
Aït-Sahalia, Y. (2002). Maximum likelihood estimation of discretely sampled diffusions: A closed-form approximation approach. Econometrica, 70(1), 223–262. [Google Scholar] [CrossRef]
Anderson, T. W., & Darling, D. A. (1952). Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. The Annals of Mathematical Statistics, 23(2), 193–212. [Google Scholar] [CrossRef]
Aslam, B., Bhuiyan, R. A., & Zhang, C. (2023). Portfolio construction with k-means clustering algorithm based on three factors. MATEC Web of Conferences, 377, 02006. [Google Scholar] [CrossRef]
Aşçıoğlu, A., Comerton-Forde, C., & McInish, T. H. (2007). Price clustering on the Tokyo stock exchange. Financial Review, 42(2), 289–301. [Google Scholar] [CrossRef]
Ball, G. H., & Hall, D. J. (1967). A clustering technique for summarizing multivariate data. Behavioral Science, 12(2), 153–155. [Google Scholar] [CrossRef] [PubMed]
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81(3), 637–654. [Google Scholar] [CrossRef]
Brown, P., & Mitchell, J. (2008). Culture and stock price clustering: Evidence from The Peoples’ Republic of China. Pacific-Basin Finance Journal, 16(1–2), 95–120. [Google Scholar] [CrossRef]
Chai, C. (2019). Application of G-brown motion in the stock price. Journal of Mathematical Finance, 10(1), 27–34. [Google Scholar] [CrossRef]
Chumpong, K., Mekchay, K., Nualsri, F., & Sutthimat, P. (2024). Closed-form formula for the conditional moment-generating function under a regime-switching, nonlinear drift CEV Process, with applications to option pricing. Mathematics, 12(17), 2667. [Google Scholar] [CrossRef]
Chumpong, K., Mekchay, K., Rujivan, S., & Thamrongrat, N. (2022). Simple analytical formulas for pricing and hedging moment swaps. Thai Journal of Mathematics, 20(2), 693–713. [Google Scholar]
Chung, K. H., Kim, K. A., & Kitsabunnarat, P. (2005). Liquidity and quote clustering in a market with multiple tick sizes. Journal of Financial Research, 28(2), 177–195. [Google Scholar] [CrossRef]
Cox, J. C., Ingersoll, J. E., & Ross, S. (1985). A theory of the term structure of interest rates. Econometrica, 53(2), 385–407. [Google Scholar] [CrossRef]
D’Agostino, R. B., & Stephens, M. A. (1986). Goodness-of-fit-techniques. Marcel Dekker. [Google Scholar]
De Gregorio, A., & Iacus, S. M. (2010). Clustering of discretely observed diffusion processes. Computational Statistics & Data Analysis, 54(2), 598–606. [Google Scholar] [CrossRef]
Duangpan, A., Boonklurb, R., Chumpong, K., & Sutthimat, P. (2022). Analytical formulas for conditional mixed moments of generalized stochastic correlation process. Symmetry, 14(5), 897. [Google Scholar] [CrossRef]
Duffie, D., Filipović, D., & Schachermayer, W. (2003). Affine processes and applications in finance. The Annals of Applied Probability, 13(3), 984–1053. [Google Scholar] [CrossRef]
Egorov, A. V., Li, H., & Xu, Y. (2003). Maximum likelihood estimation of time-inhomogeneous diffusions. Journal of Econometrics, 114(1), 107–139. [Google Scholar] [CrossRef]
Gan, G., & Chen, K. (2016). A soft subspace clustering algorithm with log-transformed distances. Big Data and Information Analytics, 1(1), 93–109. [Google Scholar] [CrossRef]
Gramuglia, E., Storvik, G., & Stakkeland, M. (2021). Clustering and automatic labelling within time series of categorical observations—With an application to marine log messages. Journal of the Royal Statistical Society Series C: Applied Statistics, 70(3), 714–732. [Google Scholar] [CrossRef]
Grossman, S. J., Miller, M. H., Cone, K. R., Fischel, D. R., & Ross, D. J. (1997). Clustering and competition in asset markets. The Journal of Law and Economics, 40(1), 23–60. [Google Scholar] [CrossRef]
Hameed, A., & Terry, E. (1998). The effect of tick size on price clustering and trading volume. Journal of Business Finance & Accounting, 25(7–8), 849–867. [Google Scholar] [CrossRef]
Heston, S. L. (1993). A closed-form solution for options with stochastic volatility with applications to bond and currency options. The Review of Financial Studies, 6(2), 327–343. [Google Scholar] [CrossRef]
Hirukawa, J. (2006). Cluster analysis for non-Gaussian locally stationary processes. International Journal of Theoretical and Applied Finance, 9(01), 113–132. [Google Scholar] [CrossRef]
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. [Google Scholar] [CrossRef]
Hull, J. C., & Basu, S. (2016). Options, futures, and other derivatives. Pearson Education India. [Google Scholar]
MacQueen, J. (1962). Some methods for classification and analysis of multivariate observations (pp. 281–297). Berkeley Symposium on Mathematical Statistics and Probability. [Google Scholar]
Merton, R. C. (1973). Theory of rational option pricing. The Bell Journal of Economics and Management Science, 4(1), 141–183. [Google Scholar] [CrossRef]
Nagy, L., & Ormos, M. (2018). Friendship of stock market indices: A cluster-based investigation of stock markets. Journal of Risk and Financial Management, 11(4), 88. [Google Scholar] [CrossRef]
Niederhoffer, V. (1965). Clustering of stock prices. Operations Research, 13(2), 258–265. [Google Scholar] [CrossRef]
Osborne, M. F. (1962). Periodic structure in the Brownian motion of stock prices. Operations Research, 10(3), 345–379. [Google Scholar] [CrossRef]
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850. [Google Scholar] [CrossRef]
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. [Google Scholar] [CrossRef]
Rujivan, S. (2010). Parameter estimation of the extended vasiček model. Walailak Journal of Science and Technology (WJST), 7(1), 69–73. [Google Scholar]
Rujivan, S. (2025). Analytically pricing volatility options and capped/floored volatility swaps with nonlinear payoffs in discrete observation case under the Merton jump-diffusion model driven by a nonhomogeneous Poisson process. Applied Mathematics and Computation, 486, 129029. [Google Scholar] [CrossRef]
Rujivan, S., Sutchada, A., Chumpong, K., & Rujeerapaiboon, N. (2023). Analytically computing the moments of a conic combination of independent noncentral chi-square random variables and its application for the extended Cox–Ingersoll–Ross process with time-varying dimension. Mathematics, 11(5), 1276. [Google Scholar] [CrossRef]
Rujivan, S., Thamrongrat, N., Juntanon, P., & Djehiche, B. (2025). Analytical computation of conditional moments in the extended Cox–Ingersoll–Ross process with regime switching: Hybrid PDE system solutions with financial applications. Mathematics and Computers in Simulation, 229, 176–202. [Google Scholar] [CrossRef]
Sutthimat, P., & Mekchay, K. (2022). Closed-form formulas for conditional moments of inhomogeneous Pearson diffusion processes. Communications in Nonlinear Science and Numerical Simulation, 106, 106095. [Google Scholar] [CrossRef]
Tolikas, K., & Heravi, S. (2008). The Anderson–Darling goodness-of-fit test statistic for the three-parameter lognormal distribution. Communications in Statistics—Theory and Methods, 37(19), 3135–3143. [Google Scholar] [CrossRef]
Ul-Islam, T. (2011). Normality testing—A new direction. International Journal of Business and Social Science, 2(3), 115–118. [Google Scholar]
Woolson, R. F. (2007). Wilcoxon signed-rank test. In Wiley encyclopedia of clinical trials (pp. 1–3). Wiley. [Google Scholar]

Figure 1. Silhouette scores of k-means clustering across different numbers of clusters (K) for the first half of the year 2020.

Figure 2. Silhouette scores of k-means clustering across different numbers of clusters (K) for the second half of the year 2020.

Figure 3. Difference Heatmap of silhouette scores across all months and cluster numbers.

Figure 4. Monthly average silhouette scores comparing clustering results from empirical LR parameters and BS parameters.

Figure 5. ARI of k-means clustering between log-return and Black–Scholes parameters, shown separately for the first half (January–June) and second half (July–December) of the year.

Figure 6. Monthly average ARI values summarizing overall clustering agreement between LR and BS parameters throughout the year.

Figure 7. Comparison of centroid-nearest representatives from BS- and LR-based clustering at

K = 8

in November 2020.

Figure 7. Comparison of centroid-nearest representatives from BS- and LR-based clustering at

K = 8

in November 2020.

Table 1. Monthly descriptive statistics across all stocks for estimated parameters.

Month	Feature	Mean	Std. Dev.	Min	Max	Skewness	Kurtosis	N
January	$\hat{μ}$	−0.0781	0.1169	−0.3157	0.2337	−0.0193	−0.0128	55
	$\hat{σ}$	0.1118	0.0372	0.0424	0.2275	0.5600	0.4916
	$\bar{μ}$	−0.0040	0.0053	−0.0150	0.0056	−0.3325	−0.6091
	$\bar{σ}$	0.0250	0.0084	0.0094	0.0502	0.4777	0.1568
February	$\hat{μ}$	−0.0870	0.0867	−0.3223	0.0388	−1.0516	1.1981	38
	$\hat{σ}$	0.1300	0.0358	0.0617	0.1997	−0.0977	−0.6185
	$\bar{μ}$	−0.0049	0.0048	−0.0179	0.0021	−1.0199	1.0613
	$\bar{σ}$	0.0311	0.0086	0.0148	0.0477	−0.0964	−0.6420
March	$\hat{μ}$	−0.0796	0.1100	−0.3018	0.1313	−0.0537	0.5292	22
	$\hat{σ}$	0.2854	0.0537	0.1833	0.3701	−0.3246	−0.3458
	$\bar{μ}$	−0.0038	0.0053	−0.0144	0.0063	−0.0352	0.5008
	$\bar{σ}$	0.0631	0.0119	0.0405	0.0818	−0.3264	−0.3494
April	$\hat{μ}$	0.1882	0.1194	−0.0230	0.5336	0.3385	0.2981	49
	$\hat{σ}$	0.1460	0.0413	0.0696	0.2284	0.0847	−0.9619
	$\bar{μ}$	0.0094	0.0060	−0.0011	0.0267	0.3440	0.3186
	$\bar{σ}$	0.0331	0.0094	0.0158	0.0518	0.0849	−0.9602
May	$\hat{μ}$	0.0540	0.0693	−0.0949	0.2235	0.0638	0.0397	56
	$\hat{σ}$	0.1007	0.0317	0.0458	0.1855	0.5062	0.0825
	$\bar{μ}$	0.0032	0.0041	−0.0056	0.0131	0.0617	0.0400
	$\bar{σ}$	0.0248	0.0078	0.0113	0.0457	0.5082	0.0854
June	$\hat{μ}$	−0.0019	0.0869	−0.1428	0.3234	1.3057	2.2533	77
	$\hat{σ}$	0.1147	0.0406	0.0419	0.2458	0.6734	0.8391
	$\bar{μ}$	−0.0001	0.0043	−0.0071	0.0162	1.3102	2.2638
	$\bar{σ}$	0.0260	0.0092	0.0095	0.0557	0.6748	0.8417
July	$\hat{μ}$	−0.0192	0.0809	−0.1629	0.2407	0.8466	1.1012	70
	$\hat{σ}$	0.0970	0.0369	0.0419	0.2666	1.6862	5.5707
	$\bar{μ}$	−0.0005	0.0061	−0.0086	0.0353	3.2149	16.9343
	$\bar{σ}$	0.0230	0.0095	0.0097	0.0619	1.8333	5.1185
August	$\hat{μ}$	0.0079	0.0609	−0.1333	0.2241	0.7572	1.6808	73
	$\hat{σ}$	0.1030	0.0506	0.0376	0.3916	2.9465	13.8249
	$\bar{μ}$	0.0004	0.0032	−0.0070	0.0118	0.7644	1.7130
	$\bar{σ}$	0.0240	0.0118	0.0088	0.0910	2.9438	13.8004
September	$\hat{μ}$	−0.0535	0.0651	−0.1806	0.1076	0.2876	−0.3085	70
	$\hat{σ}$	0.0895	0.0325	0.0380	0.1923	0.6713	0.5433
	$\bar{μ}$	−0.0028	0.0034	−0.0095	0.0057	0.2830	−0.3094
	$\bar{σ}$	0.0208	0.0076	0.0088	0.0448	0.6735	0.5515
October	$\hat{μ}$	−0.0282	0.1056	−0.3071	0.2992	0.3022	1.5921	57
	$\hat{σ}$	0.0933	0.0305	0.0231	0.1695	0.2402	−0.4345
	$\bar{μ}$	−0.0015	0.0055	−0.0162	0.0158	0.3024	1.6904
	$\bar{σ}$	0.0222	0.0070	0.0054	0.0395	0.1223	−0.4024
November	$\hat{μ}$	0.1423	0.1312	−0.0551	0.4415	0.2312	−0.9068	52
	$\hat{σ}$	0.1157	0.0412	0.0593	0.2601	1.2853	2.0257
	$\bar{μ}$	0.0072	0.0065	−0.0027	0.0221	0.2086	−0.8640
	$\bar{σ}$	0.0263	0.0093	0.0134	0.0590	1.2951	2.1146
December	$\hat{μ}$	0.0033	0.0623	−0.1246	0.1498	0.3255	−0.0538	63
	$\hat{σ}$	0.1133	0.0283	0.0595	0.1857	0.3543	0.1271
	$\bar{μ}$	0.0002	0.0034	−0.0069	0.0083	0.3186	−0.0494
	$\bar{σ}$	0.0273	0.0066	0.0142	0.0444	0.3598	0.2017

Table 2. Wilcoxon signed-rank test p-values comparing silhouette scores.

Month	Two-Sided p-Value	One-Sided p-Value (BS > LR)
January	0.1641	–
February	0.7344	–
March	0.2031	–
April	0.1289	–
May	0.7344	–
June	0.6523	–
July	0.6523	–
August	0.0078	0.0039 *
September	0.6523	–
October	0.1641	–
November	0.0039	0.0020 *
December	0.3008	–

Notes: p-values marked with * are statistically significant after Bonferroni correction.

Table 3. Average cumulative returns and Sharpe ratios in December 2020 for two selected clusters from November.

Method/Cluster	Representative Stocks	R (%)	Sharpe Ratio
BS—Cluster A	CPF, ADVANC, TU, INTUCH, BCH	−3.44	−0.12
LR—Cluster A	ADVANC, TU, INTUCH, BCH, PTG	−4.38	−0.16
BS—Cluster B	GUNKUL, BCPG, TRUE, RATCH, GFPT	−1.01	−0.01
LR—Cluster B	GUNKUL, BCPG, TRUE, RATCH, BPP	−2.33	−0.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Promsuwan, P.; Khanarsa, P.; Chumpong, K. Diffusion-Based Parameters for Stock Clustering: Sector Separation and Out-of-Sample Evidence. J. Risk Financial Manag. 2025, 18, 637. https://doi.org/10.3390/jrfm18110637

AMA Style

Promsuwan P, Khanarsa P, Chumpong K. Diffusion-Based Parameters for Stock Clustering: Sector Separation and Out-of-Sample Evidence. Journal of Risk and Financial Management. 2025; 18(11):637. https://doi.org/10.3390/jrfm18110637

Chicago/Turabian Style

Promsuwan, Piyarat, Paisit Khanarsa, and Kittisak Chumpong. 2025. "Diffusion-Based Parameters for Stock Clustering: Sector Separation and Out-of-Sample Evidence" Journal of Risk and Financial Management 18, no. 11: 637. https://doi.org/10.3390/jrfm18110637

APA Style

Promsuwan, P., Khanarsa, P., & Chumpong, K. (2025). Diffusion-Based Parameters for Stock Clustering: Sector Separation and Out-of-Sample Evidence. Journal of Risk and Financial Management, 18(11), 637. https://doi.org/10.3390/jrfm18110637

Article Menu

Diffusion-Based Parameters for Stock Clustering: Sector Separation and Out-of-Sample Evidence

Abstract

1. Introduction

2. Model and Methodology

2.1. Model

2.2. Testing of Price Distribution from SET100

2.3. Parameter Estimation

2.3.1. Estimated Parameters from the Black–Scholes Model

2.3.2. Estimated Parameters from Empirical Log Returns

2.4. K-Means Clustering for Parameters

2.5. Adjusted Rand Index

2.6. Wilcoxon Signed-Rank Test

2.7. Overall Monthly Clustering Pipeline

3. Results

3.1. Data Screening and Parameter Summary Statistics

3.2. Clustering Performance

3.3. Clustering Agreement

3.4. Statistical Comparison of Cluster Quality

3.5. Out-of-Sample Cumulative Return Analysis

4. Discussion

4.1. Summary of Findings

4.2. Interpretation of Results

4.3. Statistical and Econometric Perspectives on Robustness

4.4. Predictive Value of Clustering

4.5. Practical Implications for Market Analysis

4.6. Comparison with Previous Studies

4.7. Broader Applications: Derivatives Pricing

4.8. Study Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI