Modeling and Forecasting Cryptocurrency Closing Prices with Rao Algorithm-Based Artificial Neural Networks: A Machine Learning Approach

Artificial neural networks (ANNs) are suitable procedures for predicting financial time series (FTS). Cryptocurrencies are good investment assets; therefore, the effective prediction of cryptocurrencies has become a trending area of research. Capturing inherent uncertainties associated with cryptocurrency FTS with conventional methods is difficult. Though ANNs are the better alternative, fixing the optimal parameters of ANNs is a tedious job. This article develops a hybrid ANN through Rao algorithm (RA + ANN) for the effective prediction of six popular cryptocurrencies such as Bitcoin, Litecoin, Ethereum, CMC 200, Tether, and Ripple. Six comparative models such as GA + ANN, PSO + ANN, MLP, SVM, LSE, and ARIMA are developed and trained in a similar way. All these models are evaluated through the mean absolute percentage of error (MAPE) and average relative variance (ARV) metrics. It is found that the proposed RA + ANN generated the lowest MAPE and ARV values, statistically different as compared with existing methods mentioned above, and hence can be recommended as a potential financial instrument for predicting cryptocurrencies.


Introduction
Cryptocurrency is a virtual or digital currency that is meant to be a means of exchange for online transactions to purchase goods and services. An online ledger is used with strong cryptography for securing online transactions. Cryptos evince a lot of interest today because they are traded for-profit and because of the rich and growing valuations in relatively shorter periods of time. While these currencies are unregulated, many of the international governments and countries, after an early period of denial, are now showing signs of the adoption of some of the other forms of cryptocurrencies for transactions. Without any bank or any central authority, these cryptos are built on a decentralized or distributed peer-to-peer network. Cryptocurrencies have certain special characteristics [1]. They use decentralized control, so central authority is not required.
To avoid dispute, it ensures pseudo-anonymity. Since the reproduction of digital currency is easy, double-spending attack protection is ensured.
While Bitcoin was the earliest and most popular one [2], there are also many others, which are continuing to grow in user base, popularity, and market share, such as Ethereum, Tether, CMC 200, Ripple, Polkadot, XRP, Cardano, Chainlink, Litecoin, Bitcoin Cash, Binance Coin, etc. As the cryptocurrency market prices are growing fast and behave similarly to stock market price movement, investors, as well as researchers, are as teaching learning-based optimization [36] and Jaya algorithm [37]. With the objective of developing simple and effective optimization techniques, Rao proposed metaphor-less algorithms called Rao algorithms [25], where algorithm-specific learning parameters are not required, i.e., these algorithms are free from learning parameters. Only the initialization of population size is enough. The algorithms keep on finding the best and the worst solution from the selected population and initiate random interaction among the candidate solutions. Very soon, these algorithms are applied to mechanical system components to obtain the design optimization [14,38,39], estimation of photovoltaic cell [40,41], optimal coordination of overcurrent relays [42], etc. However, Rao algorithms are not explored for the optimization of ANN parameters. This article attempts to test out the appropriateness of Rao algorithms on finding the optimal parameters of ANNs and applying the resultant method for stock closing price prediction.
The objectives of this article are highlighted as follows: 1.
We are designing an efficient ANN forecast for the nearly precise prediction of cryptocurrencies such as Bitcoin, Litecoin, Ethereum, CMC 200, Tether, and Ripple.

2.
Suitable tuning of ANN parameters (i.e., weight and bias) by RA, thus forming a hybrid model (i.e., RA + ANN) to overcome the limitations of derivative-based optimization techniques.

3.
We are evaluating the performance of the RA + ANN forecast through two performance metrics, MAPE and ARV.
The flow of the article is as follows. A concise description of ANNs is presented in Section 2. The proposed RA + ANN is described in Section 3. The experimental data are presented in Section 4. Section 5 contains the summarized results from the experimentation and analysis of outcomes, followed by concluding remarks.

Artificial Neural Network
An ANN simulates the way the human brain works, making it different from conventional digital computers [43]. ANNs are also capable of complex data processing having neurons as computational units. ANNs have the capabilities of getting good approximation solutions to intractable problems, and they can also solve a very complex problem by decomposing it into smaller tasks, which makes them different from conventional computing.
The neural networks gain training by detecting the patterns in the data. As the human brain consists of billions of neurons, fully interconnected by synapses, similarly ANNs consist of hundreds of processing units as artificial neurons, which are fully connected through neural links. The Schematic block diagram ( Figure 1) represents the basic ANN architecture with some hidden layers of neurons and one output layer neuron. Supervised learning is used for error correction in this case, i.e., the expected response for the given inputs is submitted at the output neuron to calculate the error. In the time-series data to speculate one-day-ahead index, a single output unit model is used. A linear transfer function is used in the input layer. The hidden and output layer neurons contain nonlinear activation functions. Here, we have taken nonlinear activation as sigmoidal function in (1), i.e.,: 1 Supervised learning is used for error correction in this case, i.e., the expected response for the given inputs is submitted at the output neuron to calculate the error. In the time-series data to speculate one-day-ahead index, a single output unit model is used. A linear transfer function is used in the input layer. The hidden and output layer neurons contain nonlinear activation functions. Here, we have taken nonlinear activation as sigmoidal function in (1), i.e.,: where y out and y in represent the output and input of the neuron y, respectively, and λ acts as the sigmoidal gain. In the input layer, let there be n neurons. The input layer stands for the input variable of the problem. Each input variable input layer contains one-one node. The hidden layers are helpful in capturing the nonlinearity among variables. In the hidden layer output y j is computed using (2) for each neuron j, where f stands for a nonlinear activation function, b j is the bias value, x i represents the i th component of the input vector, w ij is the synaptic weight connecting i th input neuron and j th hidden neuron. Suppose there are m numbers of neurons present in this hidden layer, then for the next hidden layer these m outputs become the input. Then (3) for each neuron j of the next hidden layer can be represented as, This signal flows in the forward direction through each hidden layer until it reaches the node of the output layer.
For a single output neuron (4) is used to calculate the output y esst where b o is the bias for output node, the weighted sum calculated as in (2) is y j , v j represents the synaptic weight between j th hidden neuron and output neuron. Given a set of training samples S = {x i , y i } N i=1 to train the ANN, let y i be the output of i th input sample, and y esst is the computed output of the same i th input, then using (5) the error is calculated as, The error value that is produced by n th training sample at the output of neuron i is defined by (6) as, Then the instantaneous error at neuron i is defined by (7) as, Hence the total instantaneous error of the whole network, ε(n) can be defined by the following (8) as, Here C represents the set containing all the output layer neurons. In this paper, we have considered one neuron in the output layer. Therefore, here we can represent the network error in the neural network model as in (9), The average error over the whole training sample in the neural network model can be defined by (10) as, The optimal weight vector is computed by the weight update algorithm using error signal Error i . In the training phase, the training set input vectors are repeatedly presented to the neural network model to update the weights and the biases by training algorithm. If the error is small enough per epoch, then no further training is required. Unfortunately, this criterion may lead to local minimum. Therefore, the performance of the network is tested for generalization performance after each iteration of training. Once the generalization performance is sufficient or adequate, the learning process is stopped. Therefore, the intent is to reduce the total error of Equation (5) with an optimized set of weight and bias vector of the ANN.

Proposed RA + ANN Based Forecasting
Rao algorithms are newly developed population-based optimization algorithms. Unlike other nature-inspired optimization techniques, these are not mimicking the behavior of swarms, animals, birds, or any physical or chemical phenomenon. Therefore, they are claimed as metaphor-less optimization techniques by the inventor [25]. Any algorithmspecific parameters are not required by these algorithms. Only specifying the number of candidate solutions, design variables, and termination criteria is sufficient to automate the search process. The techniques are simple, do not necessitate human intervention, and provide effective solutions to complicated problems. The algorithms are presented as follows.
Let f (w) be an objective function that needs to be optimized (i.e., the error function to be minimized here). Let the population (search space of ANN model) have n number of candidate solutions (weight and bias vector), each of which has m number of design variables (weight values). Each candidate solution has a fitness value (i.e., error value), and the lower the error signal, the better is the solution. At any iteration i, let the best and worst solution of the population be f (w) best and f (w) worst respectively. The current value of a candidate solution (weight vector) at i th iteration is updated as per the following Equations (11)-(13) as, where: W j,k,i = the value of j th variable of k th solution (k = 1, 2, 3, · · · , n) at i th iteration. W j,k,i = the modified value of j th variable of k th solution (k = 1, 2, 3, · · · , n) at i th iteration.
W j,best,i = j th variable value of the best solution in i th iteration. W j,worst,i = j th variable value of the worst solution i th iteration. rand 1,j,i and rand 2,j,i are two random values in [0,1].
The terms "W j,k,i or W j,l,i " in (11)-(13) represent the fitness comparison of k th candidate solution with that of l th solution, which is randomly drawn from the population. An information exchange occurs based on their fitness values. This ensures the communication among the candidate solutions. Based on the concept discussed, the RA + ANN-based forecasting can be explained by Algorithm 1. The pictorial description is given below (Figure 2).   The RA + ANN forecasting model works as follows: the cryptocurrency's historica closing prices are collected. Approximately 60% of the total input data are pre-owned fo training the neural net, and after the network is trained, the remaining 40% of the data i used for testing the correctness of the proposed model. The closing price data are collected and stored as time-series data. We have used the sliding window approach to generat the training and testing data sets. The input for the network is normalized in the rang [0,1]. The normalized input is fed into the network to obtain an output. The differenc between the computed output (estimated) and expected output (actual) gives an error The Rao algorithm is used to update the weights to obtain the minimum error. The per formance of the network is tested for generalization performance after each iteration o The RA + ANN forecasting model works as follows: the cryptocurrency's historical closing prices are collected. Approximately 60% of the total input data are pre-owned for training the neural net, and after the network is trained, the remaining 40% of the data is used for testing the correctness of the proposed model. The closing price data are collected and stored as time-series data. We have used the sliding window approach to generate the training and testing data sets. The input for the network is normalized in the range [0,1]. The normalized input is fed into the network to obtain an output. The difference between the computed output (estimated) and expected output (actual) gives an error. The Rao algorithm is used to update the weights to obtain the minimum error. The performance of the network is tested for generalization performance after each iteration of training. Here, Rao algorithms work on identifying the best and worst solution and interaction between contestants in the search space, i.e., the population. The lack of algorithm-specific parameters eliminates any human interventions. In Algorithm 1, the size of the population, design variables, and stopping criteria are needed to be fixed at the beginning. The size of the sliding window used to form training and test patterns is a matter of experimentation.

Cryptocurrency Data
The models are evaluated on experimenting with the cryptocurrency's historical closing prices collected from https://www.CryptoDataDownload.com (accessed on 26 March 2021). The currencies are Bitcoin, Litecoin, Ethereum, CMC 200, Tither, and Ripple. The data are collected during the period from 1 January 2019 to 25 March 2021. A condensed statistic of the six datasets is gathered in Table 1. The price series are plotted in Figure 3. The raw data are normalized using the sigmoid data normalization method [44].

Experimental Results and Analysis
To evaluate the models, we used them to forecast the prices of four cryptocurrency datasets separately. To access the capacity of the RA+ANN method, five other methods such as ANN trained by genetic algorithm (GA + ANN), particle swarm optimization

Experimental Results and Analysis
To evaluate the models, we used them to forecast the prices of four cryptocurrency datasets separately. To access the capacity of the RA + ANN method, five other methods such as ANN trained by genetic algorithm (GA + ANN), particle swarm optimization trained ANN (PSO + ANN), MLP, SVM, ARIMA, and least squared estimator (LSE) are developed in this study and a comparison is performed.

Model Input Selection and Normalization
A time series approach is used in this study. A rolling window method is used to generate the train and test patterns from the dataset. The method is depicted in Figure 4. Each sample in the dataset is considered as a data point on the time series. A window of fixed size is rolled over the time series. On each movement, an old datapoint is dropped and a new datapoint is included, as shown in Figure 4. The number of datapoints included by the window at any instant of time is considered as one training sample for the model. The width of the window is determined experimentally. For example, one train/test set is formed by the rolling window of width three, as follows.
( ) ( + 1) ( + 2) Each input pattern is then normalized to scale the data into same range for each input feature to diminish the bias [44,45]. The ℎ normalization method as in Equation (14) is used to standardize the input data. The mean and standard deviation of a training pattern are represented as µ and σ respectively.

Performance Evaluation Metrics
The accuracy of all models is accessed through two mesures, i.e., Mean Absolute Percentage of Errors (MAPE) and Average Relative Variance (ARV), as in (15) and (16). Each sample in the dataset is considered as a data point on the time series. A window of fixed size is rolled over the time series. On each movement, an old datapoint is dropped and a new datapoint is included, as shown in Figure 4. The number of datapoints included by the window at any instant of time is considered as one training sample for the model. The width of the window is determined experimentally. For example, one train/test set is formed by the rolling window of width three, as follows.
Training data Each input pattern is then normalized to scale the data into same range for each input feature to diminish the bias [44,45]. The tanh normalization method as in Equation (14) is used to standardize the input data. The mean and standard deviation of a training pattern are represented as µ and σ respectively.

Performance Evaluation Metrics
The accuracy of all models is accessed through two mesures, i.e., Mean Absolute Percentage of Errors (MAPE) and Average Relative Variance (ARV), as in (15) and (16).

Experimental Setting
The train and test patterns are generated as described in Section 5.1, and the same patterns are fed to all seven models separately. To overcome the biases of the models, each model is simulated twenty times, and the mean error value is recorded for comparisons. An ANN with one hidden layer is used as the base network. The input layer size is equal to the rolling window size, and the output layer has only one neuron, as there is one target output. However, the numbers of neurons in the hidden layer are decided experimentally. An inadequate number of hidden neurons may produce poor accuracy, whereas the excess number of such neurons adds computational overhead. Therefore, the size of the hidden layer impacts the model performance a lot and must be decided carefully. During experimentation, different possible values for the model parameters were tested, and the best values were recorded. The suitable parameter values obtained during the simulation process are called simulated parameters. For the MLP model, the learning rate (α) and momentum factor (µ) are set to 0.2 and 0.5, respectively. The number of iterations was fixed to 400, and gradient descent-based backpropagation learning was adopted for training. For SVM, we used the radial basis function as the kernel. For PSO + ANN, the total number of the particle was set to 100, the number of iterations was 250, learning factors were fixed to 3 each. Similarly, in the case of GA + ANN, the three critical parameters, i.e., population size, crossover, and mutation probabilities, are considered as 80, 0.7, and 0.02, respectively. The stopping condition is the maximum number of generations, i.e., 200. Binary encoding is used for individual encoding, and elitism is used as the selection method. However, the RA + ANN used only 50 candidate solutions and 100 number iterations to reach the global optima. All the experimentations are performed with a system of Intel core i3 CPU, 2.27GHz and 8.0 GB memory, and a Matlab-2015 program writing environment.

Simulation Results and Discussion
The mean values from twenty simulations for each forecast are recorded and considered as the model performance. The MAPE and ARV values produced by seven models from six FTS are encapsulated in Table 2. It may be seen that the RA + ANN has MAPE values lower than others. For example, MAPE of RA + ANN is 0.0300 from Bitcoin, 0.0322 from Litecoin, 0.0397 from Ripple, and 0.0385 from Ethereum. It also achieved the lowest ARV values for all six FTS datasets. The overall performance of RA + ANN is better than others. Furthermore, to show the goodness of the proposed RA + ANN, the estimated prices are plotted against the expected and depicted in (Figures 5-10), respectively. It is apparent that the predicted prices are very close to the actual prices and follows the tendency accurately. For the sake of clear visibility, we plotted the estimated prices against actual prices for one financial year data (approximately 252 financial days). We also conducted two well-known statistical significance tests, such as the Wilcoxon signed test and the Diebold-Mariano test, to establish the difference between the RA + ANN forecast and comparative methods statistically considering the MAPEs from all the models. The test outcomes are summarized in Table 3. The h = 1 indicates the rejection of the null hypothesis that the proposed and comparative forecasts are not statistically different. In the case of the Diebold-Mariano test, if the statistic falls beyond the range of −1.976 to +1.976, then the null hypothesis will be rejected and h = 1. These results are in support of the rejection of the null hypothesis and prove that there is a significant difference between RA + ANN and other methods.               From the simulation studies, comparative analysis, and significance test results, the following major points are drawn.

•
The RA + ANN-based forecast was found quite capable in capturing the inherent dynamism and uncertainties associated with cryptocurrency data.

•
The hybridization of RA and ANN achieved improved forecasting accuracy compared to others.

•
The outcomes from the statistical test justified the significant difference between RA + ANN and others.

Conclusions
A hybrid ANN through Rao algorithm-based optimization (RA + ANN) is developed in this article using a single hidden layer ANN as the base neural architecture and RA as the search technique for exploiting the optimal ANN parameters. The Rao algorithms are recently proposed metaphor-less optimization techniques that are simple and do not need an algorithm-specific parameter to be adjusted. This article explored the modality of RA on training the ANN for forecasting cryptocurrency data such as Bitcoin, Litecoin, Ethereum, CMC 200, Tether, and Ripple. The RA + ANN predictability is compared with that of GA + ANN, PSO + ANN, MLP, SVM, ARIMA, and LSE in terms of MAPE and ARV. The proposed RA + ANN model has a faster convergence rate and better generalization ability compared to gradient descent learning. Furthermore, to ensure the effectiveness of RA + ANN, the Wilcoxon signed test and Diebold-Mariano test are conducted. From the comparative analysis of experimental results and outcomes of statistical significance tests, it is concluded that the RA + ANN-based forecasting is good enough to follow the dynamic trend of cryptocurrencies in comparison with the other forecasts under consideration. In the future, more ANN structures and sophisticated learning approaches may be investigated. Other factors such as the economic and technical determinants along with the closing prices may be used for achieving better forecasting accuracy. The proposed RA + ANN framework can also be applied for the efficient prediction of other existing FTS.