A Selective Portfolio Management Algorithm with Off-Policy Reinforcement Learning Using Dirichlet Distribution
Round 1
Reviewer 1 Report
The authors proposed DDT, an algorithm that calculates multiple optimal portfolios by taking Dirichlet Distribution. The positive side of this work lies in the very idea of ​​building a distribution for each asset. However, the authors' model does not provide for smooth adaptation to a change in the type of distribution. If you look at the distribution of the European gas price time series, you can see that the normal distribution abruptly changes to the Fisher distribution. It is desirable for the authors to conduct an experiment on their model on crisis assets (for example, the price of European gas). The question of distribution stability rests on an indefinite measure of ergodicity. Regardless of the type of distribution, we return to the moment of determining the time intervals. For different time intervals, the distribution will be different. The authors define the time interval as the period of time between buying and selling an asset. However, this is very arbitrary, there is no data on the frequency of the interval (day, month, year). The Dirichlet distribution is a generalization of the Beta distribution to the multivariate case. It can be assumed that the behavior of any asset can be adjusted to Dirichlet Distribution. Therefore, the question arises as to the advisability of using Dirichlet Distribution. The authors predict an asset based on this distribution. However, the distribution of an asset in the market is always dynamic. The distribution can have a right or left tail, kurtosis, skewness, and so on. Therefore, the forecast may not be entirely correct. Moreover, the authors in the article make a forecast for 2 years ahead. This is a fairly large interval for forecasting. If the market is stable and the distribution does not change on it, then it is possible to forecast for a long period. And in a crisis market, this model most likely will not work. Authors should write about the limitations of the model. Most likely, the work is focused on markets with a stable distribution and is suitable for economies with government regulation. In general, the work is a promising scientific study in terms of its novelty and level of research. The goals and objectives set in the article have been successfully solved. There were no significant shortcomings in the article.
Comments for author File: Comments.pdf
Author Response
We would like to express our sincere gratitude to the reviewer for his/her valuable suggestions on how to improve our manuscript. We have modified the manuscript according to the following comments.
The authors proposed DDT, an algorithm that calculates multiple optimal portfolios by taking Dirichlet Distribution. The positive side of this work lies in the very idea of ​​building a distribution for each asset.
However, the authors' model does not provide for smooth adaptation to a change in the type of distribution. If you look at the distribution of the European gas price time series, you can see that the normal distribution abruptly changes to the Fisher distribution. It is desirable for the authors to conduct an experiment on their model on crisis assets (for example, the price of European gas).
The question of distribution stability rests on an indefinite measure of ergodicity. Regardless of the type of distribution, we return to the moment of determining the time intervals. For different time intervals, the distribution will be different. The authors define the time interval as the period of time between buying and selling an asset. However, this is very arbitrary, there is no data on the frequency of the interval (day, month, year). The Dirichlet distribution is a generalization of the Beta distribution to the multivariate case. It can be assumed that the behavior of any asset can be adjusted to Dirichlet Distribution. Therefore, the question arises as to the advisability of using Dirichlet Distribution.
Response: We thank the reviewer for giving us helpful comments. First, we would like to emphasize that what we want to model as a probability distribution is not the price of a specific asset, but the portfolio weight of assets. So, we defined the policy as a Dirichlet Distribution suitable for portfolio management and the optimal distribution is determined based on the asset information at every time step t. This does not model the price of an asset whose distribution can vary over time.
The authors predict an asset based on this distribution. However, the distribution of an asset in the market is always dynamic. The distribution can have a right or left tail, kurtosis, skewness, and so on. Therefore, the forecast may not be entirely correct. Moreover, the authors in the article make a forecast for 2 years ahead. This is a fairly large interval for forecasting. If the market is stable and the distribution does not change on it, then it is possible to forecast for a long period. And in a crisis market, this model most likely will not work. Authors should write about the limitations of the model.
Response: We thank the reviewer for valuable comments. The review that market dynamics can change within about two years, the length of our experimental dataset, is also about changes in the specific asset price distribution. However, since we defined the policy, not the price, as a distribution, the policy distribution form is constant as a Dirichlet Distribution regardless of the market dynamics and the length of the test dataset.
Most likely, the work is focused on markets with a stable distribution and is suitable for economies with government regulation.
In general, the work is a promising scientific study in terms of its novelty and level of research. The goals and objectives set in the article have been successfully solved. There were no significant shortcomings in the article.
Reviewer 2 Report
In general: English language needs a minor improvement
Line 4: Explain the abbreviation "DDT".
Line 33: change “Deep LR” to “LR”.
Figure 4: AAPL has a “data bug”, please correct, maybe you have then to recalculate all empirical studies.
Author Response
We would like to express our sincere gratitude to the reviewer for his/her valuable suggestions on how to improve our manuscript. We have modified the manuscript according to the following comments.
- Line 4: Explain the abbreviation "DDT".
Response: We thank the reviewer for reading our paper carefully. We have revised the typos in the revised manuscript. - Line 33: change “Deep LR” to “LR”.
Response: We thank the reviewer for reading our paper carefully. We have revised the typos in the revised manuscript. - Figure 4: AAPL has a “data bug”, please correct, maybe you have then to recalculate all empirical studies.
Response: We thank the reviewer for checking the details in our dataset. We considered the bug data as outliers, removed them, and experimented. So we added the AAPL closing price plot again by removing the outliers.
Reviewer 3 Report
I am appreciative of the chance to read this informative work. While reading this study, the following alterations should be done to enhance its quality.
Regards
Comments for author File: Comments.pdf
Author Response
We would like to express our sincere gratitude to the reviewer for his/her valuable suggestions on how to improve our manuscript. We have modified the manuscript according to the following comments.
- Before submitting the paper's revised version, I found a few
typos and incorrect English expositions that may be changed.
Response: We thank the reviewer for reading our paper carefully. We have checked the typos and grammatical errors and improved English. - In order to motivate the work, there needs to be some connection to recent research; from what I can tell, fresh studies on the subject were published in 2018 to 2022.
Response: We thank the reviewer for recommending helpful references. We added them in the revised manuscript. - Do authors try to use the Markov chain Monte Carlo method to
interpret the experiments in their research?
Response: We employed Monte Carlo sampling in our experiments, and because of the randomness of this sampling process, we trained and tested the same data 10 times. - Could authors try using the Weibull, Pareto, Gompertz, etc.
distributions in instead of the Dirichlet distribution?
Response: Unfortunately, it is hard to employ other types of distributions. In this paper, our goal is to learn the distribution over portfolio vector space which has the sum-to-one constraint. Due to the sum-to-one constraint, other types of distributions are hard to be applied since the random vectors sampled from other distributions, such as Weibull, cannot satisfy the sum-to-one constraint. However, by using Dirichlet distribution, we can sample a random vector whose summation is always to be one. - Because the paper's conclusion and abstract are so similar, it
has to be revised.
Response: We thank the reviewer for the valuable comments. We revised our conclusion as follows (the revised sentences are marked as the bold font),
We proposed a Dirichlet Distribution Trader (DDT) that is a scalable DRL model for selectively managing portfolios according to the risk. Its policy has a Dirichelt Distribution in order for an agent to generate multiple portfolio samples.
Therefore, our algorithm can selectively manage the portfolio according to the level of risk after selecting 10 portfolios with low transaction cost. In the Risk-Aware Portfolio Management Experiment, we showed that the cumulative returns of portfolios corresponding to each of the three risk levels had distinct characteristics according to the trend of the dataset and showed the need for selective portfolio management.
In addition, since the value π(a|s) can be obtained from the distribution, efficient training is possible through off-policy learning by importance sampling and we showed in On-Policy, Off-Policy Experiment that it has better performance than On-Policy learning.
Our model is not limited to the number of portfolio stocks and has scalability to adjust the weight of new stocks added to the portfolio even if only three stocks in the base dataset are learned. In the scalability experiment, DDT-A, which is trained only three stocks, showed almost the same performance as DDT-B, which is trained all stocks in the portfolio.
Based on these advantages, comparative experiments show that DDT is superior to other algorithms in risk metrics and return metrics.