# Observation Time Effects in Reinforcement Learning on Contracts for Difference

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Literature Review

## 3. Methodology

#### 3.1. Data Setup

#### 3.2. The Agent

#### 3.3. Evaluation Setup

#### 3.4. Q-Learning Model

^{−5}, a value taken after testing multiple other values and checking the losses incurred, as seen in Figure 5. All other graphs for the 45 s time frame can be found in the Appendix C.

## 4. Results

## 5. Conclusions and Outlook

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Sample Availability

## Abbreviations

CfD | Contracts for Difference |

CNN | Convolutional Neural Network |

LSTM | Long Short-Term Memory |

VAE | Variational Auto-Encoder |

MDN | Mixture Density Network |

## Appendix A. Graphs

## Appendix B. Pseudocode

Algorithm A1 Training in Market Simulation |

1: procedure TRAINING |

2: Historic trade data X |

3: Prioritized experience replay memory M |

4: Input sequence length L adapted for the observation length |

5: Policy $\pi \left(s\right|\Theta )$ |

6: while learning step < number of steps do iterate of random points of time |

7: t ← random point of time in market history |

8: while $t\le len\left(X\right)-L$ do choose action based on point in time and observation length |

9: $stat{e}_{1}\leftarrow X[t:t+L]-mean\left(X[t:t+L]\right)$ |

10: action ← π (state_{1}) |

11: (state_{2}, reward, t, terminal) ← marketLogic(action) |

12: appendToMemory(M, (state_{1}, state_{2}, action, reward, terminal)) |

13: if |M| ≥ batchsize then increment learning step and update the model |

14: batch ← sample(M) |

15: losses ← Q – Learning(batch) |

16: updatePriority(M, batch) |

17: learning step ← learning step + 1 |

## Appendix C. Randomly Chosen Loss Graphs

## Appendix D. Manually Chosen Margin Graphs

## References

- Cervelló-Royo, Roberto, and Francisco Guijarro. 2020. Forecasting stock market trend: A comparison of machine learning algorithms. Finance, Markets and Valuation 6: 37–49. [Google Scholar] [CrossRef]
- Chakole, Jagdish Bhagwan, Mugdha S. Kolhe, Grishma D. Mahapurush, Anushka Yadav, and Manish P. Kurhekar. 2021. A Q-learning agent for automated trading in equity stock markets. Expert Systems with Applications 163: 113761. [Google Scholar] [CrossRef]
- Fischer, Thomas G. 2018. Reinforcemet Learning in Financial Markets—A Survey. Technical Report, FAU Discussion Papers in Economics. Erlangen: Institute for Economics, Friedrich-Alexander University Erlangen-Nuremberg. [Google Scholar]
- Golub, Anton, James Glattfelder, and Richard B. Olsen. 2018. The Alpha Engine: Designing an Automated Trading Algorithm. High-Performance Computing in Finance Problems, Methods, and Solutions 25: 49–77. [Google Scholar] [CrossRef]
- Jeong, Gyeeun, and Ha Young Kim. 2019. Improving financial trading decisions using deep Q-learning: Predicting the number of Shares, action Strategies, and transfer learning. Expert Systems with Applications 117: 125–38. [Google Scholar] [CrossRef]
- Kearns, Michael, and Luis Ortiz. 2003. The Penn-Lehman Automated Trading Project. IEEE Intelligent Systems 18: 22–31. [Google Scholar] [CrossRef]
- Kingma, Diederik P., and Jimmy Ba. 2017. Adam: A method for stochastic optimization. arXiv arXiv:1412.6980. [Google Scholar]
- Meng, Terry Lingze, and Matloob Khushi. 2019. Reinforcement learning in financial markets. Data 4: 110. [Google Scholar] [CrossRef] [Green Version]
- Venkataraman, Kumar. 2001. Automated versus floor trading: An analysis of execution costs on the Paris and New York exchanges. Journal of Finance 56: 1445–85. [Google Scholar] [CrossRef]
- Weng, Bin, Lin Lu, Xing Wang, Fadel M. Megahed, and Waldyn Martinez. 2018. Predicting short-term stock prices using ensemble methods and online data sources. Expert Systems with Applications 112: 258–73. [Google Scholar] [CrossRef]
- Zengeler, Nico, and Uwe Handmann. 2020. Contracts for Difference: A Reinforcement Learning Approach. Journal of Risk and Financial Management 13: 78. [Google Scholar] [CrossRef] [Green Version]

**Figure 4.**The neural network architecture used in our evaluation. For a number of observed assets n over a time span t, the input layer consists of (t × n × 4) input neurons. For the following convolutional layer, we choose the number of filters f, such that the resulting activation shape fits into the later convolutional layers. After the convolutional part, we employ a LSTM layer, consisting of 100 recurrent neurons with rectified linear activation.

**Figure 6.**The number of positive outcomes over all tests with a certain observation period. As shown in the chart, one can find the highest distribution of wins at the 12 min mark, closely followed by 10 s, while by far, the lowest distribution is found at the 8 min point.

**Figure 7.**The total number of positive outcomes per asset, added across all time spans. We find a hardship in trading on the OIL market, but a successful application in the EURUSD and US500 foreign exchange.

**Table 1.**A closer look at the details. Each cell contains the total number of positive outcomes (maximum 25), for each combination of asset and observation time frames. Highlighted in bold are assets that achieved a substantial amount of wins in their observation time.

Total Number of Wins Per Asset | ||||
---|---|---|---|---|

Time | US500 | OIL | GOLD | EURUSD |

10 s | 10 | 1 | 5 | 19 |

30 s | 7 | 5 | 7 | 8 |

45 s | 13 | 0 | 6 | 8 |

1 m | 15 | 3 | 11 | 4 |

5 m | 9 | 1 | 6 | 13 |

8 m | 6 | 1 | 7 | 5 |

10 m | 11 | 1 | 12 | 7 |

12 m | 9 | 3 | 7 | 17 |

Total Number of Wins Per Asset | ||||
---|---|---|---|---|

Agent | US500 | OIL | GOLD | EURUSD |

Random | 7 | 0 | 7 | 10 |

Buy-Hold | 0 | 0 | 1 | 23 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wehrmann, M.; Zengeler, N.; Handmann, U.
Observation Time Effects in Reinforcement Learning on Contracts for Difference. *J. Risk Financial Manag.* **2021**, *14*, 54.
https://doi.org/10.3390/jrfm14020054

**AMA Style**

Wehrmann M, Zengeler N, Handmann U.
Observation Time Effects in Reinforcement Learning on Contracts for Difference. *Journal of Risk and Financial Management*. 2021; 14(2):54.
https://doi.org/10.3390/jrfm14020054

**Chicago/Turabian Style**

Wehrmann, Maximilian, Nico Zengeler, and Uwe Handmann.
2021. "Observation Time Effects in Reinforcement Learning on Contracts for Difference" *Journal of Risk and Financial Management* 14, no. 2: 54.
https://doi.org/10.3390/jrfm14020054