PortRSMs: Learning Regime Shifts for Portfolio Policy

Liu, Bingde; Ichise, Ryutaro

doi:10.3390/jrfm18080434

Open AccessArticle

PortRSMs: Learning Regime Shifts for Portfolio Policy

by

Bingde Liu

^*

and

Ryutaro Ichise

^*

Department of Industrial Engineering and Economics, School of Engineering, Institute of Science Tokyo, Tokyo 152-8550, Japan

^*

Authors to whom correspondence should be addressed.

J. Risk Financial Manag. 2025, 18(8), 434; https://doi.org/10.3390/jrfm18080434

Submission received: 17 July 2025 / Revised: 1 August 2025 / Accepted: 2 August 2025 / Published: 5 August 2025

(This article belongs to the Special Issue Machine Learning Applications in Finance, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This study proposes a novel Deep Reinforcement Learning (DRL) policy network structure for portfolio management called PortRSMs. PortRSMs employs stacked State-Space Models (SSMs) for the modeling of multi-scale continuous regime shifts in financial time series, striking a balance between exploring consistent distribution properties over short periods and maintaining sensitivity to sudden shocks in price sequences. PortRSMs also performs cross-asset regime fusion through hypergraph attention mechanisms, providing a more comprehensive state space for describing changes in asset correlations and co-integration. Experiments conducted on two different trading frequencies in the stock markets of the United States and Hong Kong show the superiority of PortRSMs compared to other approaches in terms of profitability, risk–return balancing, robustness, and the ability to handle sudden market shocks. Specifically, PortRSMs achieves up to a 0.03 improvement in the annual Sharpe ratio in the U.S. market, and up to a 0.12 improvement for the Hong Kong market compared to baseline methods.

Keywords:

deep reinforcement learning; portfolio management; financial time series; regime shift models; state-space models

1. Introduction

High-frequency trading often relies on the modeling of the distribution of short-term asset returns. Many studies have shown the serial correlation of volatility in asset returns (LeBaron, 1992; Shiller, 1990), indicating that the distribution of asset returns may exhibit consistency over a period of time. Basic time-series models can effectively capture these properties (Bauwens et al., 2006; Bollerslev et al., 1994). However, due to structural changes, the statistical characteristics of asset price series can completely change from one period to another. For example, after policy or macroeconomic shocks, the volatility of stock prices may undergo drastic changes, leading to the failure of basic time-series models. On the other hand, Regime Shift Model (RSM) paradigms address the shortcomings in basic time-series modeling by dividing time series into different states, effectively dealing with such shocks. Therefore, short-term asset return distributions are widely modeled by RSMs (Cai, 1994; Haas et al., 2004; So et al., 1998).

Optimized in the Deep Reinforcement Learning (DRL) formulation, the portfolio policy network aims to generate an effective policy to guide high-frequency portfolio rebalancing trading strategies (Jiang & Liang, 2017). It is essential to let the portfolio policy network model the distribution of asset returns in each trading period. In previous work, modeling was often done by neural network models (Jiang & Liang, 2017; X. Li et al., 2022; Wang et al., 2021; Xu et al., 2021). These methods can effectively extract consistent distribution properties in a short period, but they do not follow the RSM modeling paradigm, making hem sub-optimal in financial series modeling. Constructing a policy network withing the RSM paradigm has been challenging due to the lack of previous research using deep neural networks, which is essential for DRL.

With the recent breakthroughs in State-Space Model (SSM) research in the field of deep learning (Gu et al., 2020, 2022; Schiff et al., 2024), we can now use neural networks to model regime shifts in financial series. This enables us to come up with new policy network designs. In our work, we use stacked SSMs to model multi-scale continuous regime shifts present in financial time series, serving as the backbone of the DRL policy network. This method excels at balancing the exploration of consistent distribution properties over short periods and sensitivity to sudden shocks in price sequences. We also perform regime fusion between different assets through hypergraph attention mechanisms (HGAMs) (X. Li et al., 2022), providing a more comprehensive state space for describing changes in asset correlations and co-integration. These features give our method better performance compared to previous methods. We call our method PortRSMs, which is the abbreviation of “Portfolio RSMs”.

Our contributions are summarized as follows:

We propose a new portfolio policy network structure with an RSM paradigm. The new structure can model regime shifts present in financial time series to strike a balance between exploring consistent distribution properties over short periods and maintaining sensitivity to sudden shocks for better portfolio decision-making.
We propose a method for cross-asset regime fusion through HGAM, providing a more comprehensive state space for describing changes in asset correlations and co-integration.
We conducted experiments on two different trading frequencies in the United States and Hong Kong stock markets. The experimental results showed the superiority of our method compared with other methods in terms of profitability, risk–return balancing, robustness, and ability to deal with sudden market shocks.

The remainder of this paper is organized as follows. Section 2 reviews related work in financial time-series modeling and deep reinforcement learning. Section 3 introduces the DRL framework for portfolio management and establishes the key mathematical notations used throughout the paper. Section 4 presents our proposed PortRSMs method, including the formulation of regime shift modeling and the regime fusion mechanism. Section 5 reports and analyzes the experimental results for multiple stock markets under different trading frequencies. Finally, Section 6 concludes the paper and discusses future research directions.

2. Related Work

Markowitz first introduced modern portfolio theory to design portfolios of assets with fixed weights using mean-variance analysis (Markowitz, 1952). The general portfolio algorithms are portfolio selection algorithms that rebalance the portfolio at the end of each trading period rather than using fixed weights. The general portfolio algorithms can be roughly divided into “follow the winner” (Agarwal et al., 2006; Helmbold et al., 1998), “follow the loser” (Borodin et al., 2003; Lai et al., 2018; B. Li & Hoi, 2012; B. Li et al., 2011b, 2012), and “pattern matching” (Györfi et al., 2006; B. Li et al., 2011a). Traditional strategies have a good explanatory and mathematical foundation, but they achieve suboptimal results in the long run, for they fail to model the complex dynamics of financial markets.

DRL approaches are a series of extremely strong “pattern matching” strategies that leverage the strong ability of deep learning for feature representation and pattern recognition. They have attracted significant attention in recent years. The EIIE methods (Jiang & Liang, 2017) first proposed a general framework to apply DRL for portfolio management, and they initially used the Temporal Convolutional Network (TCN) (Lea et al., 2017) and Long Short-Term Memory (LSTM) model (Hochreiter & Schmidhuber, 1997) as the policy network structure. The RAT (Xu et al., 2021) method first used the Transformer (Vaswani et al., 2017) structure as the policy network structure to extract complex information from financial time series. Those structures have also become commonly used in follow-up work (J. Li et al., 2023; X. Li et al., 2022; Wang et al., 2021). However, the above-mentioned structures are not exported by RSMs. We start with continuous-time RSMs, derive a method using SSMs (Gu et al., 2020, 2022; Schiff et al., 2024) as the policy network structure, and optimize it according to the specific needs of portfolio management tasks.

3. DRL for Portfolio Management

In this section, we briefly introduce the DRL method for the portfolio management problem, which provides the foundation for our work and establishes key mathematical symbols. Our formulation follows the foundational framework introduced in EIIE (Jiang & Liang, 2017) for DRL-based portfolio management, sharing all constraints.

3.1. Action

The portfolio vector represents the weights distributed to each asset. Let

w_{k} = (w_{k, 0}, w_{k, 1}, w_{k, 2}, \dots, w_{k, m}) \in R_{+}^{m + 1} [0, 1], s . t . \sum_{i = 0}^{m} w_{k, i} = 1

(1)

be the portfolio vector before trading period k, where

i = 0, 1, 2, \dots, m

indexes the assets and

k \in N_{+}

indexes the trading periods. Note that

w_{k, 0}

indicates the weight allocated to the risk-free asset (e.g., cash).

w_{k}

directly defines the action to take in trading period k in the framework.

w_{k}

then transitions to

w_{k}^{'} \in R_{+}^{m + 1} [0, 1]

, i.e., the portfolio vector after period k, due to changes in the prices of the assets.

3.2. State

In the framework, the state comprises market environment features and current asset holdings. Research in investment science indicates that due to market inefficiency, asset price data over a historical period has a certain predictive power for future price changes (Bustos & Pomares-Quimbaya, 2020; Jegadeesh & Titman, 2023). Therefore, the most intuitive market features are asset price time series. Asset price time series are sampled with equal intervals, using techniques like candlestick charts. Let

t : = k T \in N_{+}

represent the sample timestamps, where

T \in N_{+}

represents how many timestamps there are in a trading period. The portfolio vector after period

k - 1

, represented as

w_{k - 1}^{'}

, is also included in the state, since adjusting it to

w_{k}

incurs transaction fees. In summary, the state in the framework is

s_{t} = (P_{k}, w_{k - 1}^{'})

, where

$P_{k} : = (P_{k T - l + 1}, \dots, P_{k T}) \in R_{+}^{l \times (m + 1) \times 4}$ are the $l \in N_{+}$ latest samples in the price series in trading period k;
$P_{t} : = (P_{t, 0}, \dots, P_{t, m}) \in R_{+}^{(m + 1) \times 4}$ represents the price samples of all assets at timestamp t;
$P_{t, i} : = (p_{t, i}^{O}, p_{t, i}^{H}, p_{t, i}^{L}, p_{t, i}^{C}) \in R_{+}^{4}$ indicates the opening, highest, lowest, and closing prices of asset i in the sample with timestamp t.

Initially,

w_{0}^{'} = (1, 0, 0, \dots, 0)

, i.e., the situation where all funds are in the risk-free asset.

3.3. State Transition Function

Let

r_{k, i} \in R_{+}

be the price change ratio of asset i in trading period k, and let

r_{k} : = (r_{k, 1}, r_{k, 2}, \dots, r_{k, m}) \in R_{+}^{m + 1}

represent the price change ratios of all assets in trading period k, where

r_{k, i} : = \frac{p_{(k + 1) T, i}^{C}}{p_{k T, i}^{C}}

represents the price change ratio of a certain asset. The price change ratios (

r_{k}

) can be predicted from

P_{k}

; namely, there exists a probability model, i.e.,

T^{'} (r_{k} | P_{k})

. Given

r_{k}

, the portfolio vector (

w_{k}

) deterministically becomes

w_{k}^{'}

. Concurrently, the new asset price series (

P_{k}

) generated deterministically by the price changes becomes observable. The state transition model (

T (s_{k + 1} | s_{k}, a_{k}) : = T ((P_{k}, w_{k}^{'}) | (P_{k}, w_{k - 1}^{'}), w_{k})

) can then be readily obtained from the probability model (

T^{'}

) and all the aforementioned deterministic relationships.

3.4. Reward Function

Under portfolio vector

w_{k}

, the return generated by the price changes (

r_{k}

) is rewarded. Note that the transaction fees incurred as a result of adjusting the portfolio from

w_{k - 1}^{'}

to

w_{k}

should be deducted from the reward. Let the transaction fee ratio be

c \in R [0, 1]

; then, the return can be calculated as follows:

r_{t} : = w_{k}^{T} r_{k} - c | | w_{k} - w_{k - 1}^{'} {| |}_{1},

(2)

where the first term is the initial return and the second term represents the trading costs incurred from transaction fees. The formulation of the reward function in trading period k is

R (s_{k}, a_{k}) = log (r_{k} + 1)

, where

a_{k} = w_{k}

. Here, the return is taken as the logarithm to ensure the additivity of the reward function over time.

3.5. Deterministic Policy Gradient

In deterministic policy gradient algorithms, the policy network (

π_{θ} (s)

) is a neural network that maps the current state to an action (Silver et al., 2014). The policy network is parameterized by

θ

, which refers to the trainable weights and biases of the neural network. During training, the reward function is differentiated directly, and gradient ascent is applied to update

θ

with a learning rate of

α \in R^{+}

, according to

θ \to θ + α \nabla R (s, π_{θ} (s))

, where

\nabla R (s, π_{θ} (s))

denotes the gradient of the reward function with respect to the policy parameters (

θ

).

4. PortRSMs

In this section, we introduce how we establish the PortRSMs method step by step, including the mathematical form of RSMs and regime fusion, as well as the portfolio weight-generating method. Figure 1 is a data flow diagram providing an overview of PortRSMs.

4.1. SSMs

Recent research used time-series modeling with SSMs to describe regime shifts (Gu et al., 2020). For continuous-time signals, SSMs have the following form:

\begin{matrix} \{\begin{matrix} u^{'} (t) = A u (t) + B x (t), & (T r a n s i t i o n M o d e l) \\ y (t) = C u (t) + D x (t), & (E m i s s i o n M o d e l) \end{matrix} \\ w h e r e, u (t) \in R^{h \times 1}, x (t) \in R^{d \times 1}, y (t) \in R^{d \times 1}, \\ A \in R^{h \times h}, B \in R^{h \times d}, C \in R^{d \times h}, D \in R^{d \times d} \end{matrix}

(3)

where

u (t)

represents the hidden state with hidden dimension

h \in R_{+}

.

x (t)

describes the observable instantaneous information at time t, while

y (t)

is the instantaneous information that cannot be observed up to time t. Information is described with the

d \in R_{+}

dimension vector. In addition to

A

for describing inherent transitions of hidden states, the transition model also uses matrix

B

to describe how current observable instantaneous information affects hidden states. The emission model then maps

u (t)

and

x (t)

to current

y (t)

through matrices

C

and

D

.

The follow-up research improved SSMs. On one hand,

A

and

D

are considered trainable neural network parameters (Gu et al., 2022). On the other hand,

B

and

C

are considered time-variant parameters (

B_{t}

and

C_{t}

, respectively)to improve performance (Schiff et al., 2024). This makes SSMs a linear time-variant system where

B_{t}

and

C_{t}

are obtained through nonlinear transformation:

\{\begin{matrix} B_{t} = δ (Γ_{B} x^{⊤} (t)), \\ C_{t} = δ (Γ_{C} x^{⊤} (t)), \end{matrix} w h e r e, Γ_{B}, Γ_{C} \in R^{h \times d}

(4)

where

δ (.)

is an SiLU activation function1 and

Γ_{B}

and

Γ_{C}

are learnable projection matrices.

4.2. RSMs in Price Series

RSMs, based on the improvement of hidden Markov models, are widely used in modeling asset price change ratio distributions (Cai, 1994; Haas et al., 2004; So et al., 1998). They take instantaneous prices as the input information and the instantaneous price change ratios as the output.

In our work, we use SSMs to model the continuous regime shifts from the discretized price series. Therefore, RSMs are defined as follows:

\begin{matrix} x_{t, i} = δ (Γ_{P} P_{t, i}), \{\begin{matrix} u_{t, i} = {\tilde{A}}_{t, i} u_{t - 1, i} + {\tilde{B}}_{t, i} x_{t, i}, \\ y_{k, i} = δ (Γ_{C} x_{t, i}) u_{t, i} + D x_{t, i}, \end{matrix} \\ s . t . t = k T, {\tilde{A}}_{t, i} = e x p (Δ_{t, i} A), {\tilde{B}}_{t, i} = {(Δ_{t, i} A)}^{- 1} (e x p (Δ_{t, i} A_{t, i}) - I) Δ_{t, i} B_{t, i}, \\ B_{t, i} = δ (Γ_{B} x_{t, i}), Δ_{t, i} = δ (Γ_{Δ} x_{t, i}) \\ w h e r e . Γ_{P} \in R^{d \times 4}, Γ_{Δ} \in R^{h \times d} \end{matrix}

(5)

where

Γ_{P}

is a projection matrix to restore input information (

x_{t, i}

) from

P_{t, i}

, as defined in Section 3.2, while

y_{k, i}

is a descriptor vector for the distribution characteristics of

r_{k, i}

, as defined in Section 3.3.

{\tilde{A}}_{τ, i}

and

{\tilde{B}}_{τ, i}

represent discretized versions of

A

and

B_{τ, i}

. In the case of zero-order hold sampling, the relationships between

{\tilde{A}}_{τ, i}, {\tilde{B}}_{τ, i}

and

A, B_{τ, i}

have been shown in existing research (Gu et al., 2022; Schiff et al., 2024).

Δ_{τ, i}

describes the time scale that one sample interval maps to in continuous perspective. Note that

Δ_{τ, i}

is also a time-variant parameter obtained from

x_{t, i}

by the non-linear transformation by projection matrix

Γ_{Δ}

and the activation function. This property further enhances its ability to model the widespread fractal properties in financial time series (Evertsz, 1995; Ni et al., 2011; Peters, 1989). The time-variant

Δ_{τ, i}

can map different dynamics from discrete perspectives to the same dynamics in continuous perspectives to model the continuous regime shifts. During training, optimization is carried out for parameters

A, D, μ_{i}, Γ_{B}, Γ_{C}, Γ_{Δ}

, and

Γ_{P}

.

Notice that the absolute price ranges of different assets at different times may vary greatly. On the one hand, this can lead to similar dynamics being considered completely different, further causing a decrease in training data efficiency. On the other hand, it can result in uneven gradient values, making convergence difficult during training. Therefore, data normalization is important in data pre-processing. We adopt the same approach used in previous work (Jiang & Liang, 2017; Xu et al., 2021), using the latest close price in the state representation defined in Section 3.2, i.e.,

p_{k T}^{C}

, as the denominator to normalize all price data in

P_{k}

. To apply this method, in each trading period (k), the SSM is re-executed on

P_{k}

, i.e., all samples of the latest l timestamps. The calculation can be performed in a high-speed way on modern hardware by converting recursion to convolution (Schiff et al., 2024). In each calculation,

u_{k T - l, i}

is initialized as a zero vector.

4.3. Stacked SSMs

To further improve performance, considering the existence of multiple regime shifts of different scales in the financial time series, we use stacked SSMs to model them. The formulation of RSMs modeled by N-layer SSMs is expressed as follows:

\begin{matrix} x_{t, i}^{1} = δ (Γ_{P} P_{t, i}), \{\begin{matrix} u_{t, i}^{1} = {\tilde{A}}_{t, i}^{1} u_{t, i}^{1} + {\tilde{B}}_{t, i}^{1} x_{t, i}^{1}, \\ x_{k, i}^{2} = y_{t, i}^{1} = δ (Γ_{C}^{1} x_{t, i}) u_{t, i}^{1} + D_{t, i}^{1} x_{t, i}^{1}, \end{matrix} \\ \{\begin{matrix} u_{t, i}^{2} = {\tilde{A}}_{t, i}^{2} u_{t, i}^{2} + {\tilde{B}}_{t, i}^{2} x_{t, i}^{2}, \\ x_{t, i}^{3} = y_{t, i}^{2} = δ (Γ_{C}^{2} x_{t, i}) u_{t, i}^{2} + D_{t, i}^{2} x_{t, i}^{2}, \end{matrix} \dots \\ \{\begin{matrix} u_{t, i}^{N} = {\tilde{A}}_{t, i}^{N} u_{t, i}^{N} + {\tilde{B}}_{t, i}^{N} x_{t, i}^{N}, \\ y_{k, i}^{N} = y_{t, i}^{N} = δ (Γ_{C}^{N} x_{t, i}) u_{t, i}^{N} + D_{t, i}^{N} x_{t, i}^{N}, \end{matrix} \\ s . t . t = k T, {\tilde{A}}_{t, i}^{n} = e x p (Δ_{t, i}^{n} A^{n}), {\tilde{B}}_{t, i}^{n} = {(Δ_{t, i}^{n} A^{n})}^{- 1} (e x p (Δ_{t, i}^{n} A_{t, i}^{n}) - I) Δ_{t, i}^{n} B_{t, i}^{n}, \\ B_{t, i}^{n} = δ (Γ_{B}^{n} x_{t, i}^{n}), Δ_{t, i}^{n} = δ (Γ_{Δ}^{n} x_{t, i}^{n}), n = 1, 2, \dots, N \end{matrix}

(6)

Note that parameters are not shared between layers. In this case, we take the output (

y_{t, i}^{n - 1}

) of the

n - 1

th SSM layer as the input (

x_{t, i}^{n}

) of the nth SSM layer. Each SSM layer can model regime shifts over the regimes modeled by the previous layer on a more abstract scale, thereby mining more stable patterns and longer dependencies.

4.4. Hypergraph Attention for Cross-Asset Regime Fusion

So far, we have only discussed the situation of individual assets. In the portfolio management task, multiple asset price series should be modeled simultaneously. The correlation and co-integration between these series can shift over time. Modeling these properties has been shown to be crucial for the learning of portfolio policy in existing work (X. Li et al., 2022; Shi et al., 2022; Soleymani & Paquet, 2021; Xu et al., 2021).

We include considerations of relevance and co-integration in the RSMs by utilizing HGAM (X. Li et al., 2022). The HGAM aims to learn the differing importance of asset neighbors for information merging with the attention mechanism (Vaswani et al., 2017). For stock i and its neighbor (

j = 0, 1, \dots, m

), quantifying the degree to which i is related to j based on the output (

y_{k, i}^{n}

) of the nth SSM layer, i.e.,

D (y_{k, i}^{n}, y_{k, j}^{n}) = δ (b^{n} [Γ_{R}^{n} y_{k, i}^{n}, Γ_{R}^{n} y_{k, j}^{n}]),

(7)

where

Γ_{R}^{n} \in R^{d \times 1}

is a projection matrix to be learned,

b^{n} \in R^{1 \times d}

is a shared attention vector, d is the model dimension defined in Section 4.2,

[.]

denotes the concatenation operation,

δ (.)

denotes a nonlinear activation function like LeakyReLU2. Then, the softmax function is applied to obtain the importance weight (

α_{i j}

):

α_{i j} = \frac{exp (D (y_{k, i}^{n}, y_{k, j}^{n}))}{\sum_{v = 0, 1, \dots, m} exp (D (y_{k, i}^{n}, y_{k, v}^{n}))} .

(8)

After that,

y_{t, i}^{n}

is aggregated across assets as follows:

y_{k, i}^{n} \to δ (\sum_{j = 0, 1, \dots, m} α_{i j} P^{n} y_{k, j}^{n})

(9)

In the case of stacked SSMs, information from different assets is aggregated layer by layer, thereby achieving cross-asset regime fusion, providing more comprehensive state spaces for the description of changes in asset correlations and co-integration.

4.5. Portfolio Generation

The last SSM-layer output (

y_{k, i}^{N} \in R^{d \times 1}

) is a descriptor of

r_{k, i}

distribution characteristics.

T^{'}

, as defined in Section 3.3, can be established following previous practices (Jiang & Liang, 2017; Xu et al., 2021). The formulation is expressed as follows:

\begin{matrix} w_{k} = S o f t m a x ([b, {(Γ_{w} [w_{k - 1}^{'}, y_{k}])}^{⊤}]), \\ w h e r e . y_{k} = {(y_{k, 1}^{N}, y_{k, 2}^{N}, \dots, y_{k, m}^{N})}^{⊤} \in R^{d \times m}, w_{k - 1}^{'} \in R^{1 \times m}, b \in R^{1 \times 1}, Γ_{w} \in R^{(d + 1) \times 1} \end{matrix}

(10)

where [.] denotes the concatenation operation in the first dimension. After concatenating the distribution descriptors and portfolio vector (

w_{k - 1}^{'}

, without risk-free asset weights), the absolute score of each asset can be obtained by applying a projection matrix (

Γ_{w}

). Finally,

w_{k}

can be generated from absolute scores by concatenating a cash bias (b) and applying a softmax function on the first dimension.

5. Experiments

5.1. Experimental Settings

5.1.1. Datasets

We conducted experiments with two different trading frequencies (i.e., 1 day and 1 week as the duration of the trading period) in the United States and Hong Kong stock markets. Specifically, we used the Yahoo Finance API (Yahoo, 2025) and AKShare API (King, 2019) to collect data on the Dow Jones Industrial Average (DJIA) and Hang Seng Index (HSI) constituents from 1 January 2004 to 1 January 2024. The data are sampled in daily frequency, which means

T = 1

when the trading period is 1 day and

T = 5

when the trading period is 1 week. Stocks with more than 70% missing data were excluded. When reporting the experimental results, we use DJIA1d/DJIA1w and HSI1d/HSI1w to represent the names of the dataset. Furthermore, 1d means 1 day as the duration of the trading period, while 1w means 1 week. For trading fees, we used an industry-standard round-trip cost of 0.06%. The training set and test set are divided chronologically in a ratio of 4:1.

5.1.2. Methods for Comparison

We use the constant rebalanced portfolio (CRP) and buy-and-hold (BAH) method as a benchmark. CRP allocates equal funds to all assets at all times, representing average market performance. BAH allocates equal funds to all assets in the first trading period, without any further trading actions. We also compare our method with traditional portfolio management algorithms (EG (Helmbold et al., 1998), OLMAR (B. Li & Hoi, 2012), RMR (Huang et al., 2016), BNN (Györfi et al., 2006), and CORN (B. Li et al., 2011a)), as well as state-of-the-art DRL-based methods with different policy network structures (bRNN EIIE (Jiang & Liang, 2017), CNN EIIE (Jiang & Liang, 2017), RAT (Xu et al., 2021), HGAM (X. Li et al., 2022), and LSRE-CAAN (J. Li et al., 2023)).

5.1.3. Evaluation Metric

Following previous works (Jiang & Liang, 2017; Wang et al., 2021; Xu et al., 2021), we use three metrics to evaluate performance: the annualized return (AR), annualized Sharpe ratio (ASR), and annualized Calmar ratio (ACR). AR measures compound annual portfolio growth over a period, with a higher AR indicating profitability. ASR measures volatility-adjusted return, with a higher ASR reflecting better risk–return balancing. ACR is a risk–return balancing metric similar to ASR that measures return adjusted by the maximum draw-down in the profit. We used five different random seeds

{0, 64, 128, 256, 512}

for each experimental group to conduct repeated experiments and reported the mean and standard deviation of the results to reduce the randomness of the results and study the robustness of training. We plot the graph of cumulative log return over time in the testing period for qualitative analysis.

5.1.4. Implementation Details

Training was performed using the Adam optimizer on a single NVIDIA RTX A6000 GPU, with the learning rate set to

α = 1.92 \times 10^{- 5}

and a batch size of 32. We used a model dimension of

d = 36

and hidden dimension of

h = 16

. By default, we use an SSM layer number of

N = 2

and sample count of

l = 50

. Cross-asset regime fusion is also applied by default.

5.2. Ablation Studies

5.2.1. Modeling Paradigm

We conducted experiments by replacing the SSM module with LSTM (Hochreiter & Schmidhuber, 1997), TCN (Lea et al., 2017), and Transformer (Vaswani et al., 2017) decoder modules with the same model dimension. As for the quantitative results, we present the AR in Table 1, ASR in Table 2, and ACR in Table 3. SSM achieves the best performance on most datasets with most of the time-series samples (l).

Insights into the performance of different modules can be gained based on how their performance changes with l and the stability of training performance with different random seeds. Note that when using the LSTM module, although the algorithm’s performance remains stable for most of the l, it cannot surpass our SSMs’ performance, for it has sub-optimal modeling paradigms relative to RSMs. TCN shows a larger variance in performance metrics compared to other modules when trained with different random seeds, likely due to the assignment of equal weights to each sample during learning, which can lead to overfitting on noise. The Transformer’s performance is highly unstable as l varies, especially on the HSI dataset, where its performance sharply declines as l increases because the attention mechanism tends to focus on over-long dependencies during training, affecting sensitivity toward sudden shocks.

With the RSM paradigm, the SSM module exhibits better stability as l changes and lower variance when trained with different random seeds compared to other models, indicating a balance between the extraction of consistent distribution properties and sensitivity towards sudden shocks.

Figure 2 and Figure 3 show a performance comparison of different modules from a qualitative analysis perspective. The SSM module has a stable and rapidly growing return curve in most periods compared with other modules. Specifically, as shown in Figure 3, during the COVID outbreak period, when overall market prices were falling rapidly, the SSM module did not incur excessive losses, and during periods of price rebounding, the SSM module quickly identified profit opportunities in the market, resulting in higher returns than other methods.

5.2.2. Stacked SSMs

In the ablation experiment, we analyze the role of each component in the proposed algorithm. First, we compare the effect of the SSM module with the RSM paradigm and other existing time-series modeling modules with other paradigms by replacing each layer with other modules with the same model dimension. Secondly, we study the effect of stacked SSMs by changing the number of SSM layers. Finally, we compare the performance of algorithms with and without cross-asset regime fusion to illustrate its effectiveness.

We study the relationship between different numbers of SSM layers and performance when the time-series sample count (l) varies. As for the performance metrics, we present the AR in Table 4, ASR in Table 5, and ACR in Table 6. Note that on the DJIA1d and HSI1d datasets, the choice of layer number is independent of l. This indicates that in high-frequency trading, it is only necessary to model very short-term regime shifts, so increasing l does not lead to significant changes in regime-shift information. In contrast, on the DJIA1w dataset, increasing l results in richer regime-shift information for portfolio policy generation. Therefore, using more layers for RSM modeling results in better performance. Finally, note that on the HSI1w dataset, although performance improves with three layers compared to using two SSM layers as l increases, better performance is achieved with just one layer instead. This is because the HSI1w dataset exhibits a clear mean-reversion property that can be modeled effectively with only one SSM layer. The mean-reversion nature of HSI1w is also shown in the results in Section 5.3.

5.2.3. Cross-Asset Regime Fusion

We study the relationship between performance with and without cross-asset regime fusion when the time-series sample count (l) varies. As for the performance metrics, we present the AR in Table 7, ASR in Table 8, and ACR in Table 9. In 1d high-frequency trading, cross-asset regime fusion can consistently improve performance, indicating that asset correlation and co-integration are particularly important for the generation of high-frequency portfolio policies. However, in 1w low-frequency trading scenarios, whether cross-asset regime fusion can improve performance is closely related to l. In both DJIA and HSI, although cross-asset regime fusion can enhance performance at small values of l, it may lead to a decrease in performance as l increases. This suggests that the correlation and co-integration of assets in DJIA and HSI are relatively unstable, where interference from instability characteristics within large samples leads to decreased performance.

5.3. Comparison with Other Methods

We compare our method’s performance to that of other methods, showing the AR in Table 10, ASR in Table 11, and ACR in Table 12. Overall, the use of DRL methods generally outperforms traditional methods. However, on HSI1w, mean reversion-based methods such as OLMAR (B. Li & Hoi, 2012) and RMR (Huang et al., 2016) perform better than deep learning methods, indicating a clear mean reversion characteristic on HSI1w. Previous DRL method performances depended on the dataset. CNN EIIE (Jiang & Liang, 2017) achieves good results on the DJIA dataset. The LSRE-CAAN (J. Li et al., 2023) method performs well on the DJIA dataset. RAT (Xu et al., 2021) handles HSI datasets well. Compared with other methods, our PortRSMs achieved the best performance among almost all datasets using DRL methods, showing its effectiveness and robustness.

6. Conclusions

This paper presents an innovative portfolio policy network structure that effectively addresses the challenges in financial time-series modeling by combining the regime shift model (RSM) paradigm with recent advancements in deep learning techniques. The experimental results across multiple stock markets and trading frequencies confirm the superiority of the proposed approach. This paper offers new insights for the development of high-frequency trading strategies and contributes valuable perspectives to the field of financial time-series modeling.

However, we acknowledge certain limitations of the current study. Due to the varying characteristics of different financial markets and time periods, some architecture hyper-parameters of the proposed model need to be adjusted for different scenarios. This sensitivity limits the model’s direct generalization across markets.

In addition, our approach implicitly assumes that short-term asset return distributions are locally consistent within regimes and that regime shifts can be effectively captured by structural patterns in the data. This assumption is supported by prior work on volatility clustering and regime shift models (RSMs) but may not hold under all market conditions, such as in cases of highly irregular or non-stationary shocks.

To address this, future research could explore the integration of meta-learning or online learning techniques, which are capable of adapting to dynamic and heterogeneous market environments in a more automated and robust manner.

Future research could also explore the application of the proposed method in other financial markets and asset classes, as well as further optimize the model to adapt to more complex market environments.

Author Contributions

Conceptualization, B.L.; methodology, B.L.; software, B.L.; validation, B.L. and R.I.; formal analysis, B.L. and R.I.; investigation, B.L.; resources, R.I.; data curation, B.L.; writing—original draft preparation, B.L.; writing—review and editing, R.I.; visualization, B.L.; supervision, R.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Institute of Science Tokyo.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Notes

1	The SiLU (Sigmoid Linear Unit) function is defined as $SiLU (x) = x \cdot sigmoid (x)$ and is known for being smooth and non-monotonic.
2	LeakyReLU is defined as $f (x) = x$ for $x > 0$ and $f (x) = α x$ for $x \leq 0$ , where $α$ is a small constant (e.g., 0.01), allowing a small gradient when the input is negative.

References

Agarwal, A., Hazan, E., Kale, S., & Schapire, R. E. (2006, June 25–29). Algorithms for portfolio management based on the newton method. 23rd International Conference on Machine Learning (pp. 9–16), Pittsburgh, PA, USA. [Google Scholar] [CrossRef]
Bauwens, L., Laurent, S., & Rombouts, J. V. (2006). Multivariate GARCH models: A survey. Journal of Applied Econometrics, 21(1), 79–109. [Google Scholar] [CrossRef]
Bollerslev, T., Engle, R. F., & Nelson, D. B. (1994). Chapter 49 Arch models. In Handbook of econometrics (Vol. 4, pp. 2959–3038). Elsevier. [Google Scholar]
Borodin, A., El-Yaniv, R., & Gogan, V. (2003, December 8–13). Can we learn to beat the best stock. Advances in Neural Information Processing Systems (Vol. 16), Vancouver, BC, Canada. [Google Scholar]
Bustos, O., & Pomares-Quimbaya, A. (2020). Stock market movement forecast: A systematic review. Expert Systems with Applications, 156, 113464. [Google Scholar] [CrossRef]
Cai, J. (1994). A markov model of switching-regime ARCH. Journal of Business & Economic Statistics, 12(3), 309–316. [Google Scholar] [CrossRef]
Evertsz, C. J. (1995). Fractal geometry of financial time series. Fractals, 3(3), 609–616. [Google Scholar] [CrossRef]
Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. (2020, December 6–12). HiPPO: Recurrent memory with optimal polynomial projections. 34th International Conference on Neural Information Processing Systems (Vol. 33, pp. 1474–1487), Vancouver, BC, Canada. [Google Scholar]
Gu, A., Goel, K., & Re, C. (2022, April 25–29). Efficiently modeling long sequences with structured state spaces. International Conference on Learning Representations, Virtual. [Google Scholar]
Györfi, L., Lugosi, G., & Udina, F. (2006). Nonparametric kernel-based sequential investment strategies. Mathematical Finance, 16(2), 337–357. [Google Scholar] [CrossRef]
Haas, M., Mittnik, S., & Paolella, M. S. (2004). A new approach to markov-switching GARCH models. Journal of Financial Econometrics, 2(4), 493–530. [Google Scholar] [CrossRef]
Helmbold, D. P., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1998). On-line portfolio selection using multiplicative updates. Mathematical Finance, 8(4), 325–347. [Google Scholar] [CrossRef]
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. [Google Scholar] [CrossRef]
Huang, D., Zhou, J., Li, B., Hoi, S. C., & Zhou, S. (2016). Robust median reversion strategy for online portfolio selection. IEEE Transactions on Knowledge and Data Engineering, 28(9), 2480–2493. [Google Scholar] [CrossRef]
Jegadeesh, N., & Titman, S. (2023). Momentum: Evidence and insights 30 years later. Pacific-Basin Finance Journal, 82, 102202. [Google Scholar] [CrossRef]
Jiang, Z., & Liang, J. (2017, September 7–8). Cryptocurrency portfolio management with deep reinforcement learning. 2017 Intelligent Systems Conference (pp. 905–913), London, UK. [Google Scholar]
King, A. (2019). Akshare. Available online: https://github.com/akfamily/akshare (accessed on 31 July 2025).
Lai, Z.-R., Yang, P.-Y., Fang, L., & Wu, X. (2018). Reweighted price relative tracking system for automatic portfolio optimization. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 50(11), 4349–4361. [Google Scholar] [CrossRef]
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., & Hager, G. D. (2017, July 21–26). Temporal convolutional networks for action segmentation and detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1003–1012), Honolulu, HI, USA. [Google Scholar]
LeBaron, B. (1992). Some relations between volatility and serial correlations in stock market returns. The Journal of Business, 65(2), 199–219. [Google Scholar] [CrossRef]
Li, B., & Hoi, S. C. (2012, June 26–July 1). On-line portfolio selection with moving average reversion. 29th International Conference on Machine Learning (pp. 273–280), Edinburgh, Scotland. [Google Scholar]
Li, B., Hoi, S. C., & Gopalkrishnan, V. (2011a). CORN: Correlation-driven nonparametric learning approach for portfolio selection. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–29. [Google Scholar] [CrossRef]
Li, B., Hoi, S. C., Zhao, P., & Gopalkrishnan, V. (2011b, April 11–13). Confidence weighted mean reversion strategy for on-line portfolio selection. 14th International Conference on Artificial Intelligence and Statistics (Vol. 15, pp. 434–442), Fort Lauderdale, FL, USA. [Google Scholar]
Li, B., Zhao, P., Hoi, S. C., & Gopalkrishnan, V. (2012). PAMR: Passive aggressive mean reversion strategy for portfolio selection. Machine Learning, 87(2), 221–258. [Google Scholar] [CrossRef]
Li, J., Zhang, Y., Yang, X., & Chen, L. (2023). Online portfolio management via deep reinforcement learning with high-frequency data. Information Processing & Management, 60(3), 103247. [Google Scholar] [CrossRef]
Li, X., Cui, C., Cao, D., Du, J., & Zhang, C. (2022, May 23–27). Hypergraph-based reinforcement learning for stock portfolio selection. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4028–4032), Singapore. [Google Scholar]
Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1), 77–91. [Google Scholar] [CrossRef] [PubMed]
Ni, L.-P., Ni, Z.-W., & Gao, Y.-Z. (2011). Stock trend prediction based on fractal feature selection and support vector machine. Expert Systems with Applications, 38(5), 5569–5576. [Google Scholar] [CrossRef]
Peters, E. E. (1989). Fractal structure in the capital markets. Financial Analysts Journal, 45(4), 32–37. [Google Scholar] [CrossRef]
Schiff, Y., Kao, C.-H., Gokaslan, A., Dao, T., Gu, A., & Kuleshov, V. (2024, July 21–27). Caduceus: Bi-directional equivariant long-range DNA sequence modeling. 41st International Conference on Machine Learning (Vol. 235, pp. 43632–43648), Vienna, Austria. [Google Scholar]
Shi, S., Li, J., Li, G., Pan, P., Chen, Q., & Sun, Q. (2022). GPM: A graph convolutional network based reinforcement learning framework for portfolio management. Neurocomputing, 498, 14–27. [Google Scholar] [CrossRef]
Shiller, R. J. (1990). Market volatility and investor behavior. The American Economic Review, 80(2), 58–62. [Google Scholar]
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014, June 21–26). Deterministic policy gradient algorithms. 31st International Conference on Machine Learning (pp. 387–395), Beijing, China. [Google Scholar]
So, M. K. P., Lam, K., & Li, W. K. (1998). A stochastic volatility model with markov switching. Journal of Business & Economic Statistics, 16(2), 244–253. [Google Scholar] [CrossRef] [PubMed]
Soleymani, F., & Paquet, E. (2021). Deep graph convolutional reinforcement learning for financial portfolio management—DeepPocket. Expert Systems with Applications, 182, 115127. [Google Scholar] [CrossRef]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017, December 4–9). Attention is all you need. International Conference on Neural Information Processing Systems (pp. 6000–6010), Long Beach, CA, USA. [Google Scholar]
Wang, Z., Huang, B., Tu, S., Zhang, K., & Xu, L. (2021, February 2–9). DeepTrader: A deep reinforcement learning approach for risk-return balanced portfolio management with market conditions embedding. AAAI Conference on Artificial Intelligence (pp. 643–650), Virtual. [Google Scholar]
Xu, K., Zhang, Y., Ye, D., Zhao, P., & Tan, M. (2021, January 7–15). Relation-aware transformer for portfolio policy learning. 29th International Conference on International Joint Conferences on Artificial Intelligence (pp. 4647–4653), Yokohama, Japan. [Google Scholar]
Yahoo. (2025). Yahoo finance—Stock market live, quotes, business & finance news. Available online: https://finance.yahoo.com/ (accessed on 31 July 2025).

Figure 1. Data flow diagram of PortRSMs at timestamp t and trading period k (

t = k T

as defined in Section 3.2). PortRSMs uses sample

P

in the asset price time series as input. To model RSMs, SSMs are used to simulate the transition of state u. The output (y) of each SSM layer is used as the input of the next layer, and the output of each layer is fused between the asset through the HGAM. The output of the last layer generates new portfolio weights (

w_{k}

) through the output projection and softmax functions, after concatenating with the existing portfolio weights (

w_{k - 1}^{'}

).

Figure 1. Data flow diagram of PortRSMs at timestamp t and trading period k (

t = k T

as defined in Section 3.2). PortRSMs uses sample

P

in the asset price time series as input. To model RSMs, SSMs are used to simulate the transition of state u. The output (y) of each SSM layer is used as the input of the next layer, and the output of each layer is fused between the asset through the HGAM. The output of the last layer generates new portfolio weights (

w_{k}

) through the output projection and softmax functions, after concatenating with the existing portfolio weights (

w_{k - 1}^{'}

).

Figure 2. The cumulative log return in the testing period when using different modules, with

l = 50

. The result for CRP is also shown for comparison as a benchmark.

Figure 2. The cumulative log return in the testing period when using different modules, with

l = 50

. The result for CRP is also shown for comparison as a benchmark.

Figure 3. The cumulative log return during the COVID outbreak period for the DJIA dataset when using different modules, with

l = 50

. The result for CRP is also shown for comparison as a benchmark.

Figure 3. The cumulative log return during the COVID outbreak period for the DJIA dataset when using different modules, with

l = 50

. The result for CRP is also shown for comparison as a benchmark.

Table 1. AR performance comparison of different modules across various datasets, using different time-series sample counts (l). Bold values indicate the best result in each group of comparisons, and underlined values indicate the suboptimal result.

Dataset	Module	$l = 20$	$l = 50$	$l = 100$
DJIA1d	LSTM	0.104 ± 0.003	0.105 ± 0.004	0.104 ± 0.003
	TCN	0.099 ± 0.011	0.046 ± 0.036	0.093 ± 0.054
	Transformer	0.104 ± 0.001	0.109 ± 0.003	0.119 ± 0.013
	SSMs	0.109 ± 0.008	0.121 ± 0.012	0.127 ± 0.013
DJIA1w	LSTM	0.124 ± 0.006	0.127 ± 0.008	0.131 ± 0.009
	TCN	0.158 ± 0.022	0.171 ± 0.035	0.298 ± 0.078
	Transformer	0.132 ± 0.010	0.123 ± 0.003	0.156 ± 0.029
	SSMs	0.161 ± 0.015	0.170 ± 0.026	0.157 ± 0.017
HSI1d	LSTM	0.218 ± 0.034	0.175 ± 0.036	0.201 ± 0.085
	TCN	0.236 ± 0.020	0.211 ± 0.054	0.160 ± 0.086
	Transformer	0.232 ± 0.034	0.179 ± 0.020	0.112 ± 0.052
	SSMs	0.296 ± 0.088	0.220 ± 0.029	0.247 ± 0.122
HSI1w	LSTM	0.225 ± 0.012	0.231 ± 0.015	0.233 ± 0.018
	TCN	0.215 ± 0.047	0.245 ± 0.048	0.183 ± 0.087
	Transformer	0.251 ± 0.023	0.180 ± 0.071	0.145 ± 0.064
	SSMs	0.292 ± 0.048	0.279 ± 0.048	0.251 ± 0.076

Table 2. ASR performance comparison of different modules across various datasets, using different time-series sample counts (l). Bold values indicate the best result in each group of comparisons, and underlined values indicate the suboptimal result.

Dataset	Module	$l = 20$	$l = 50$	$l = 100$
DJIA1d	LSTM	0.558 ± 0.006	0.559 ± 0.010	0.557 ± 0.007
	TCN	0.506 ± 0.067	0.394 ± 0.136	0.532 ± 0.077
	Transformer	0.558 ± 0.002	0.565 ± 0.004	0.586 ± 0.028
	SSMs	0.573 ± 0.025	0.613 ± 0.036	0.630 ± 0.040
DJIA1w	LSTM	0.621 ± 0.011	0.619 ± 0.008	0.616 ± 0.008
	TCN	0.532 ± 0.081	0.514 ± 0.055	0.681 ± 0.099
	Transformer	0.623 ± 0.003	0.617 ± 0.005	0.642 ± 0.019
	SSMs	0.646 ± 0.024	0.662 ± 0.018	0.665 ± 0.014
HSI1d	LSTM	0.695 ± 0.066	0.635 ± 0.051	0.676 ± 0.121
	TCN	0.724 ± 0.044	0.652 ± 0.088	0.565 ± 0.179
	Transformer	0.776 ± 0.040	0.710 ± 0.037	0.503 ± 0.155
	SSMs	0.808 ± 0.142	0.768 ± 0.040	0.792 ± 0.147
HSI1w	LSTM	0.756 ± 0.013	0.779 ± 0.027	0.785 ± 0.027
	TCN	0.718 ± 0.128	0.714 ± 0.062	0.613 ± 0.214
	Transformer	0.854 ± 0.049	0.679 ± 0.193	0.537 ± 0.136
	SSMs	0.868 ± 0.060	0.899 ± 0.065	0.793 ± 0.113

Table 3. ACR performance comparison of different modules across various datasets, using different time-series sample counts (l). Bold values indicate the best result in each group of comparisons, and underlined values indicate the suboptimal result.

Dataset	Module	$l = 20$	$l = 50$	$l = 100$
DJIA1d	LSTM	0.307 ± 0.004	0.307 ± 0.006	0.306 ± 0.004
	TCN	0.235 ± 0.073	0.164 ± 0.122	0.325 ± 0.119
	Transformer	0.305 ± 0.001	0.309 ± 0.002	0.321 ± 0.015
	SSMs	0.313 ± 0.015	0.345 ± 0.025	0.354 ± 0.029
DJIA1w	LSTM	0.387 ± 0.017	0.388 ± 0.014	0.385 ± 0.066
	TCN	0.364 ± 0.139	0.393 ± 0.116	0.724 ± 0.279
	Transformer	0.405 ± 0.011	0.399 ± 0.007	0.473 ± 0.079
	SSMs	0.499 ± 0.075	0.555 ± 0.101	0.515 ± 0.066
HSI1d	LSTM	0.567 ± 0.102	0.488 ± 0.058	0.560 ± 0.182
	TCN	0.590 ± 0.08	0.488 ± 0.113	0.431 ± 0.210
	Transformer	0.661 ± 0.069	0.604 ± 0.060	0.354 ± 0.181
	SSMs	0.728 ± 0.217	0.688 ± 0.068	0.762 ± 0.245
HSI1w	LSTM	0.804 ± 0.053	0.884 ± 0.058	0.881 ± 0.060
	TCN	0.676 ± 0.262	0.688 ± 0.131	0.478 ± 0.225
	Transformer	0.979 ± 0.112	0.588 ± 0.252	0.346 ± 0.158
	SSMs	1.189 ± 0.142	1.282 ± 0.182	0.830 ± 0.160

Table 4. Performance comparison of different numbers of stacked SSM layers across various datasets, using different time-series sample count (l), as evaluated by AR. Bold values indicate the best result in each group of comparisons, and underlined values indicate the suboptimal result.

Dataset	Layer Num	$l = 20$	$l = 50$	$l = 100$
DJIA1d	1	0.100 ± 0.010	0.110 ± 0.014	0.114 ± 0.011
	2	0.109 ± 0.008	0.121 ± 0.012	0.127 ± 0.013
	3	0.107 ± 0.005	0.116 ± 0.015	0.125 ± 0.019
DJIA1w	1	0.140 ± 0.003	0.156 ± 0.006	0.164 ± 0.008
	2	0.161 ± 0.015	0.170 ± 0.026	0.157 ± 0.017
	3	0.142 ± 0.030	0.187 ± 0.052	0.190 ± 0.048
HSI1d	1	0.273 ± 0.060	0.186 ± 0.022	0.181 ± 0.022
	2	0.296 ± 0.088	0.220 ± 0.029	0.247 ± 0.122
	3	0.355 ± 0.096	0.262 ± 0.082	0.317 ± 0.164
HSI1w	1	0.284 ± 0.007	0.341 ± 0.020	0.315 ± 0.031
	2	0.292 ± 0.048	0.279 ± 0.048	0.251 ± 0.076
	3	0.282 ± 0.040	0.294 ± 0.071	0.272 ± 0.029

Table 5. Performance comparison of different numbers of stacked SSM layers across various datasets, using different time-series sample counts (l), as evaluated by ASR. Bold values indicate the best result in each group of comparisons, and underlined values indicate the suboptimal result.

Dataset	Layer Num	$l = 20$	$l = 50$	$l = 100$
DJIA1d	1	0.559 ± 0.004	0.582 ± 0.014	0.589 ± 0.020
	2	0.573 ± 0.025	0.613 ± 0.036	0.630 ± 0.040
	3	0.569 ± 0.014	0.596 ± 0.046	0.614 ± 0.054
DJIA1w	1	0.635 ± 0.008	0.654 ± 0.006	0.661 ± 0.006
	2	0.646 ± 0.024	0.662 ± 0.018	0.665 ± 0.014
	3	0.639 ± 0.037	0.658 ± 0.026	0.670 ± 0.023
HSI1d	1	0.763 ± 0.093	0.713 ± 0.043	0.685 ± 0.036
	2	0.808 ± 0.142	0.768 ± 0.040	0.792 ± 0.147
	3	0.948 ± 0.141	0.818 ± 0.105	0.880 ± 0.213
HSI1w	1	0.859 ± 0.012	0.950 ± 0.033	0.886 ± 0.041
	2	0.868 ± 0.060	0.899 ± 0.065	0.793 ± 0.113
	3	0.856 ± 0.069	0.905 ± 0.052	0.827 ± 0.041

Table 6. Performance comparison of different numbers of stacked SSM layers across various datasets, using different time-series sample counts (l), as evaluated by ACR. Bold values indicate the best result in each group of comparisons, and underlined values indicate the suboptimal result.

Dataset	Layer Num	$l = 20$	$l = 50$	$l = 100$
DJIA1d	1	0.309 ± 0.004	0.310 ± 0.012	0.316 ± 0.010
	2	0.313 ± 0.015	0.345 ± 0.025	0.354 ± 0.029
	3	0.314 ± 0.006	0.336 ± 0.032	0.343 ± 0.041
DJIA1w	1	0.428 ± 0.016	0.483 ± 0.014	0.508 ± 0.026
	2	0.499 ± 0.075	0.555 ± 0.101	0.515 ± 0.066
	3	0.476 ± 0.089	0.604 ± 0.190	0.611 ± 0.169
HSI1d	1	0.633 ± 0.128	0.603 ± 0.080	0.553 ± 0.059
	2	0.728 ± 0.217	0.688 ± 0.068	0.762 ± 0.245
	3	0.914 ± 0.210	0.780 ± 0.186	0.908 ± 0.306
HSI1w	1	1.159 ± 0.047	1.479 ± 0.088	1.206 ± 0.178
	2	1.189 ± 0.142	1.282 ± 0.182	0.830 ± 0.178
	3	1.209 ± 0.220	1.291 ± 0.278	0.979 ± 0.200

Table 7. Performance comparison (AR) of algorithms with and without cross-asset regime fusion across various datasets and time-series sample counts (l). Bold values indicate the best result in each comparison group. Bold values indicate the best result in each comparison group. (✔) means with cross-asset regime fusion, and (×) means without.

Dataset	Fusion	$l = 20$	$l = 50$	$l = 100$
DJIA1d	×	0.100 ± 0.004	0.103 ± 0.007	0.112 ± 0.005
DJIA1d	✔	0.109 ± 0.008	0.121 ± 0.012	0.127 ± 0.013
DJIA1w	×	0.145 ± 0.006	0.158 ± 0.003	0.166 ± 0.006
DJIA1w	✔	0.161 ± 0.015	0.170 ± 0.026	0.157 ± 0.017
HSI1d	×	0.192 ± 0.033	0.171 ± 0.008	0.171 ± 0.006
HSI1d	✔	0.296 ± 0.088	0.220 ± 0.029	0.247 ± 0.122
HSI1w	×	0.244 ± 0.015	0.305 ± 0.015	0.257 ± 0.016
HSI1w	✔	0.292 ± 0.048	0.279 ± 0.048	0.251 ± 0.076

Table 8. Performance comparison (ASR) of algorithms with and without cross-asset regime fusion across various datasets and time-series sample counts (l). Bold values indicate the best result in each comparison group. (✔) means with cross-asset regime fusion, and (×) means without.

Dataset	Fusion	$l = 20$	$l = 50$	$l = 100$
DJIA1d	×	0.557 ± 0.002	0.563 ± 0.010	0.578 ± 0.013
DJIA1d	✔	0.573 ± 0.025	0.613 ± 0.036	0.630 ± 0.040
DJIA1w	×	0.642 ± 0.006	0.661 ± 0.004	0.668 ± 0.004
DJIA1w	✔	0.646 ± 0.024	0.662 ± 0.018	0.665 ± 0.014
HSI1d	×	0.647 ± 0.041	0.678 ± 0.013	0.655 ± 0.015
HSI1d	✔	0.808 ± 0.142	0.768 ± 0.040	0.792 ± 0.147
HSI1w	×	0.816 ± 0.019	0.909 ± 0.019	0.836 ± 0.026
HSI1w	✔	0.868 ± 0.060	0.899 ± 0.065	0.793 ± 0.113

Table 9. Performance comparison (ACR) of algorithms with and without cross-asset regime fusion across various datasets and time-series sample counts (l). Bold values indicate the best result in each comparison group. Bold values indicate the best result in each comparison group. (✔) means with cross-asset regime fusion, and (×) means without.

Dataset	Fusion	$l = 20$	$l = 50$	$l = 100$
DJIA1d	×	0.308 ± 0.001	0.307 ± 0.002	0.308 ± 0.007
DJIA1d	✔	0.313 ± 0.015	0.345 ± 0.025	0.354 ± 0.029
DJIA1w	×	0.446 ± 0.019	0.474 ± 0.049	0.524 ± 0.019
DJIA1w	✔	0.499 ± 0.075	0.555 ± 0.101	0.515 ± 0.066
HSI1d	×	0.485 ± 0.063	0.557 ± 0.024	0.521 ± 0.024
HSI1d	✔	0.728 ± 0.217	0.688 ± 0.068	0.762 ± 0.245
HSI1w	×	1.006 ± 0.044	1.311 ± 0.062	0.963 ± 0.086
HSI1w	✔	1.189 ± 0.142	1.282 ± 0.182	0.830 ± 0.160

Table 10. Performance comparison (AR) of different methods across various datasets. The best and second best results are marked by bold and underlined values, respectively.

Method	DJIA1d	DJIA1w	HSI1d	HSI1w
CRP	0.104	0.104	0.021	0.012
BAH	0.093	0.092	0.014	0.007
EG	0.104	0.104	0.020	0.012
OLMAR	−0.102	−0.042	−0.240	0.316
RMR	−0.211	0.024	−0.229	0.508
BNN	0.079	−0.171	−0.095	−0.171
CORN	−0.075	−0.069	−0.184	−0.314
CNN EIIE	0.114 ± 0.012	0.138 ± 0.028	0.082 ± 0.065	0.130 ± 0.094
bRNN EIIE	0.097 ± 0.006	0.122 ± 0.003	0.043 ± 0.013	0.107 ± 0.040
RAT	0.103 ± 0.002	0.124 ± 0.003	0.149 ± 0.013	0.215 ± 0.045
HGAM	0.103 ± 0.000	0.170 ± 0.056	0.078 ± 0.061	0.241 ± 0.110
LSRE-CAAN	0.119 ± 0.000	0.102 ± 0.029	0.011 ± 0.007	0.009 ± 0.000
PortRSMs	0.121 ± 0.012	0.170 ± 0.026	0.220 ± 0.029	0.279 ± 0.048

Table 11. Performance comparison (ASR) of different methods across various datasets. The best and second best results are marked by bold and underlined values, respectively.

Method	DJIA1d	DJIA1w	HSI1d	HSI1w
CRP	0.561	0.595	0.204	0.156
BAH	0.520	0.551	0.171	0.134
EG	0.559	0.593	0.202	0.154
OLMAR	−0.034	0.177	−0.262	0.815
RMR	−0.353	0.297	−0.225	1.116
BNN	0.388	−0.328	0.059	−0.216
CORN	−0.092	−0.108	−0.185	−0.542
CNN EIIE	0.584 ± 0.022	0.635 ± 0.032	0.398 ± 0.193	0.523 ± 0.287
bRNN EIIE	0.557 ± 0.001	0.625 ± 0.005	0.308 ± 0.056	0.516 ± 0.081
RAT	0.560 ± 0.001	0.620 ± 0.002	0.644 ± 0.041	0.805 ± 0.102
HGAM	0.559 ± 0.000	0.669 ± 0.034	0.428 ± 0.216	0.814 ± 0.176
LSRE-CAAN	0.564 ± 0.000	0.508 ± 0.094	0.182 ± 0.004	0.149 ± 0.000
PortRSMs	0.613 ± 0.036	0.662 ± 0.018	0.768 ± 0.040	0.899 ± 0.065

Table 12. Performance comparison (ACR) of different methods across various datasets. The best and second best results are marked by bold and underlined values, respectively.

Method	DJIA1d	DJIA1w	HSI1d	HSI1w
CRP	0.310	0.332	0.067	0.043
BAH	0.279	0.297	0.039	0.023
EG	0.309	0.330	0.065	0.042
OLMAR	−0.149	0.064	−0.294	0.963
RMR	−0.304	0.036	−0.286	1.305
BNN	0.135	−0.239	0.124	−0.235
CORN	−0.154	−0.118	−0.240	−0.377
CNN EIIE	0.312 ± 0.005	0.462 ± 0.070	0.268 ± 0.206	0.452 ± 0.326
bRNN EIIE	0.309 ± 0.002	0.395 ± 0.004	0.150 ± 0.047	0.399 ± 0.069
RAT	0.308 ± 0.001	0.402 ± 0.004	0.514 ± 0.045	0.856 ± 0.209
HGAM	0.308 ± 0.000	0.578 ± 0.191	0.274 ± 0.204	0.870 ± 0.316
LSRE-CAAN	0.307 ± 0.000	0.292 ± 0.078	0.062 ± 0.022	0.025 ± 0.000
PortRSMs	0.345 ± 0.025	0.555 ± 0.101	0.688 ± 0.068	1.282 ± 0.182

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, B.; Ichise, R. PortRSMs: Learning Regime Shifts for Portfolio Policy. J. Risk Financial Manag. 2025, 18, 434. https://doi.org/10.3390/jrfm18080434

AMA Style

Liu B, Ichise R. PortRSMs: Learning Regime Shifts for Portfolio Policy. Journal of Risk and Financial Management. 2025; 18(8):434. https://doi.org/10.3390/jrfm18080434

Chicago/Turabian Style

Liu, Bingde, and Ryutaro Ichise. 2025. "PortRSMs: Learning Regime Shifts for Portfolio Policy" Journal of Risk and Financial Management 18, no. 8: 434. https://doi.org/10.3390/jrfm18080434

APA Style

Liu, B., & Ichise, R. (2025). PortRSMs: Learning Regime Shifts for Portfolio Policy. Journal of Risk and Financial Management, 18(8), 434. https://doi.org/10.3390/jrfm18080434

Article Menu

PortRSMs: Learning Regime Shifts for Portfolio Policy

Abstract

1. Introduction

2. Related Work

3. DRL for Portfolio Management

3.1. Action

3.2. State

3.3. State Transition Function

3.4. Reward Function

3.5. Deterministic Policy Gradient

4. PortRSMs

4.1. SSMs

4.2. RSMs in Price Series

4.3. Stacked SSMs

4.4. Hypergraph Attention for Cross-Asset Regime Fusion

4.5. Portfolio Generation

5. Experiments

5.1. Experimental Settings

5.1.1. Datasets

5.1.2. Methods for Comparison

5.1.3. Evaluation Metric

5.1.4. Implementation Details

5.2. Ablation Studies

5.2.1. Modeling Paradigm

5.2.2. Stacked SSMs

5.2.3. Cross-Asset Regime Fusion

5.3. Comparison with Other Methods

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI