Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement

Forecasting the direction and trend of stock price is an important task which helps investors to make prudent financial decisions in the stock market. Investment in the stock market has a big risk associated with it. Minimizing prediction error reduces the investment risk. Machine learning (ML) models typically perform better than statistical and econometric models. Also, ensemble ML models have been shown in the literature to be able to produce superior performance than single ML models. In this work, we compare the effectiveness of tree-based ensemble ML models (Random Forest (RF), XGBoost Classifier (XG), Bagging Classifier (BC), AdaBoost Classifier (Ada), Extra Trees Classifier (ET), and Voting Classifier (VC)) in forecasting the direction of stock price movement. Eight different stock data from three stock exchanges (NYSE, NASDAQ, and NSE) are randomly collected and used for the study. Each data set is split into training and test set. Ten-fold cross validation accuracy is used to evaluate the ML models on the training set. In addition, the ML models are evaluated on the test set using accuracy, precision, recall, F1-score, specificity, and area under receiver operating characteristics curve (AUC-ROC). Kendall W test of concordance is used to rank the performance of the tree-based ML algorithms. For the training set, the AdaBoost model performed better than the rest of the models. For the test set, accuracy, precision, F1-score, and AUC metrics generated results significant to rank the models, and the Extra Trees classifier outperformed the other models in all the rankings.


Introduction
Forecasting future the trend and direction of stock price movement is an essential task which helps investors to take prudent financial decisions in the stock (equity) market. However, carrying out such task is very exacting since factors (such as political events, economic factors, investors' sentiments, etc.) that influence the behavior of the equity market change constantly at a great pace, and they are greatly affected by the high degree of noise [1]. For many years, investors and researchers were of the belief that the stock price cannot be predicted. This belief came into existence due to random walk theory [2][3][4] and the efficient market hypothesis (EMH) [5,6]. Supporters of the random walk theory have the belief that stock prices will move along a random walk path and any forecast of the movement of stock will be about fifty percent (50%) [7]. Also, EMH posits that, the equity market reflects every currently available information, hence, it cannot be forecasted to consistently make economic gains that surpass the overall market average. However, numerous research studies have provided evidence to the contrary, to show that the equity market can be forecasted to some extent [8][9][10][11]. Investors have been able to gain profit by beating the market [12]. Investing in equity markets has big risk associated with it due to the fact that financial market time series are non-parametric, dynamic, noisy, and chaotic. To be able to minimize the associated risk of equity market investment, a foreknowledge on the future movement of stock price is needed. A precise forecast of equity market movement is essential in order to maximize capital gain and minimize loss, since investors are likely to buy or desist from stock whose future value is expected to rise or fall respectively. Methods such as technical analysis, fundamental analysis, time series forecasting, and machine learning (ML) exist to forecast the behavior of stock prices. In this paper, we focus on the use of ML to predict stock price behavior as it is known from the literature that ML models typically produce better results than statistical and econometric models [13][14][15], and it captures the non-linear nature of the equity market better than the other methods. Also, with the availability of huge amounts of stock trading data as a result of advances in computing, the task of predicting the behavior of the equity market is too massive to be carried out with the other methods. Moreover, with ML technique, individual models can be combined to obtain a reduction in variance and improve the performance of the models [16]. Predicting stock price with ML models has two approaches: (a) using the single ML model [9,[17][18][19][20][21], and (b) using ensemble ML models [16,[22][23][24][25][26][27]. Application of ensemble ML models has been reported in the literature by some researchers to produce superior performance than single ML models [24,28,29]. Hence, in this article, we study tree-based ensemble ML models in an effort to predict the direction of stock price movement. Thus, our goal is to build an intelligent tree-based ensemble machine learning models that learn from the past stock market data and estimate the direction of stock price movement. Precisely, we examine and compare the effectiveness of the following tree-based ensemble ML models in forecasting the direction of stock price movement: (i) Random Forest Classifier (RF), (ii) XGBoost Classifier (XG), (iii) Bagging Classifier (BC), (iv) AdaBoost Classifier (Ada), (v) Extra Trees classifier (ET), and (vi) Voting Classifier (VC).

Experimental Design
In this experiment, stock data were acquired through the yahoo finance application programming interface (API). Technical indicators that reflect variations in price over time are computed. The data are subjected to two preprocessing steps: (i) data cleaning; to take care of missing and erroneous values; (ii) data normalization; to enable the machine learning models to perform well. Feature extraction technique is used to extract the relevant features for the machine learning models. The tree-based ensemble machine learning models are trained and predictions made with them. The models are evaluated and ranked using different evaluation metrics.

Data and Features
Eight stock data sets are randomly collected from three different stock exchanges (New York Stock Exchange (NYSE), National Association of Securities Dealers Automated Quotations System (NASDAQ), and National Stock Exchange of India Ltd (NSE)). The data for the following stocks are  Table 1 presents a description of the dataset. Forty (40) technical indicators from four different categories of technical indicators (namely, momentum indicators, volume indicators, price transform, and overlap studies) are computed from the OHLCV data collected and used as input features. Details of the overlap studies, volume indicators, price transform, and momentum indicators are provided by Tables A1-A4 respectively at the Appendix A. Each dataset is divided into training and test sets for the purpose of this experiment. The training set constitute the initial 70% of the data set, and the test set is made up of the final 30% of the data set. Ten-fold cross-validation is used for the Training set. In the cross-validation, the data set is split into 10 groups. To train the model, nine out of the 10 groups are used to train the model and the remaining group is used to evaluate the performance of the model. This process is repeated 10 times with a different 10th of the dataset used to test the remaining 9 groups during every run of the 10-fold cross validation.

Data Normalization
The different features do not have the same range of values. Therefore, we normalized the input dataset to bring the values of all features in the same range. Standardization scaling (z-score) is applied to normalize the feature set. The z-score transform features so that they have characteristics of a Gaussian distribution with the values of each feature having a mean of zero and a unit-variance [30].
where µ i = mean of the ith feature, σ i = standard deviation of the ith feature.

Feature Extraction
The final dataset comprises of 45 predictors (40 technical indicators and the OHLCV variables). High dimensional data suffers from the curse of dimensionality which causes the performance and accuracy of learning algorithms to reduce. Hence, dimensionality reduction process is essential in the study. However, getting most of the information offered by the original features is of extreme importance. PCA is applied in this study. PCA has been shown to improve the stability and performance of models in stock prediction [31,32]. PCA operates with the aim of extracting and keeping only the most relevant information of the original dataset. It achieves this aim by employing orthogonal transformation to transform values of possibly correlated features into values of features that are linearly uncorrelated. These new features are known as principal components (PC). The PCs are linear combinations of features of the original dataset hence, the reconstruction error is greatly reduced. PCA generates orthogonal components, implying that they are not correlated to each other. The selection of the first PC is done in a way that reduces the distance between the data and its projection onto the PC. By reducing the distance, we increase the variance of the projected points. The subsequent PCs are chosen in a similar manner but with the added obligation that they should be uncorrelated with the preceding PCs. In several instances, most variance within the dataset are accounted for by the initial few PCs, therefore, the remainder of the PCs can be disregarded with only a minor information loss. Our stock market dataset appears to have many highly correlated features, hence, applying PCA helps us lessen the effect of strong correlations among features, while decreasing the dimensionality of the feature space. We used PCs that preserve most of the variance (information) of the original data, hence setting the threshold to 95%.

Machine Learning Algorithms
The effectiveness of tree-based ML ensemble models (Random Forest classifier, XGBoost classifier, AdaBoost classifier, Bagging classifier, Extra Trees classifier, and Voting classifier) in forecasting the direction of stock price movement is examined in the study. A brief discussion of these ensemble tree-based classifiers is provided here.

Base Classifier
In this study all the ensemble classifiers use decision tree classifier as the base classifier. A decision tree (DT) makes predictions on a target variable based on a sequence of rules set in a tree-like structure. It comprises of non-leaf node(s) representing test on an attribute, branches representing possible outcomes of the test, and leaf nodes representing class labels. Decision tree classifies a new observation by navigating them down the tree from the root to a leaf node, based on the output of the tests along the path [33]. DT follows a similar approach that human beings generally follow in making decisions. Hence, DT models are intuitive and can be explained easily.

Random Forest Classifier
Random Forest (RF) operates by constructing a group of decision trees to enhance the robustness and performance of the decision trees [34]. This method merges the random selection of features technique [35][36][37] and Breiman's bagging sampling method [38], to build a group of decision trees with controlled variation. Employing bagging, each decision tree within the group is created by means of a sample with replacement from the training data. Each decision tree within the group acts as a base estimator to establish the class label of an unlabeled instance. This is accomplished through majority votes. Each of the base decision tree model casts a vote for the class label it predicted. The class label that gets majority of the votes is used to classify the instance. RF is robust to noise and overfitting [39]. The Random forest algorithm has been applied in several fields by different researchers. Some of the recent applications of random forest algorithm include random forest for label ranking [40], stock selection with random forest [41], structured random forest for label distribution learning [42], clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis [43], and the application of random forest-based approaches to surface-enhanced Raman scattering data [44].

AdaBoost Classifier
AdaBoost is a boosting machine learning technique that works by combining multiple weak learners into a special classifier via a weighted linear combination. AdaBoost sequentially applies a learning algorithm to reweighted samples of the original training data [45]. It is an iterative algorithm and, in each iteration, the misclassified instances in a prior iteration are given more weight. Initially, each instance is assigned equal weight and iteration by iteration, the weights of all wrongly classified instances are raised and that of rightly classified instances are reduced. The algorithm iterates repeatedly applying the base classifier on the training data with new weights. The final classification model produced is a linear combination of all the models gotten in the different iterations [46]. AdaBoost fully considers the weight of every classifier, however, it is sensitive to outliers and noisy data. The application of AdaBoost algorithm in the literature is very diverse. Recent usages of AdaBoost include time series classification based on Arima and AdaBoost [47], an AdaBoost algorithmic method for computational financial analysis [48], and an AdaBoost classifier using stochastic diffusion search model for data optimization in Internet of Things [49].

XGBoost Classifier
XGBoost is a scalable and efficient variant of the gradient tree boosting algorithm. In the gradient boosting algorithm, boosting is viewed as an optimization problem with the aim of minimizing the loss function of the classification model by addition of one weaker learner at a time. The algorithm continuously minimizes errors of the previous models in the direction of gradient to produce a new model [50]. XGBoost algorithm incorporates the following features [51]: (a) regularized model to prevent overfitting (b) sparsity-aware split finding algorithm to deal with different kinds of sparsity patterns in the data (c) distributed weighted quantile sketch algorithm to deal effectively with weighted data (d) column block structure for parallel learning (e) cache-aware prefetching algorithm to fetch and store the gradient statistics (f) blocks for out-of-core computation. Recent applications of XGBoost in the literature include hard rock pillar stability forecast using GBDT, XGBoost, and LightGBM algorithms [52], gene expression value forecast based on XGBoost Algorithm [53], and the enhancement of diagnosis of depression using XGBOOST model and a large biomarkers Dutch dataset [54].

Bagging Classifier
A Bagging classifier is an ensemble meta-estimator. This algorithm creates multiple models by fitting each base classifier on a random subsample of the original dataset and then combine the results of all the models to determine the final prediction. The bagging classifier uses either the greatest mean probability among the base classifiers or majority voting to establish the predicted label. Since the original training dataset is re-sampled with replacement, certain instances may be selected many times while others are not selected at all. The meta-estimator reduces the variance of the base estimator through the introduction of randomization into the construction method and then generating an ensemble from it [55]. The base classifiers are trained in parallel with the subgroup of the training set generated via random selection with replacement from the original dataset. The training dataset of every base classifier is independent of the others. The Bagging algorithm has also seen extensive application in the literature. Some of the recent applications of the bagging algorithm include comparative application of Bagging and Boosting ensemble machine learning approaches for automated EMG signal classification [56], an enhancement of the performance of Bagging for classification of imbalanced datasets with evolutionary multi-objective optimization by [57].

Extra Trees Classifier
The Extra-Trees classifier creates a group of unpruned decision trees in accordance with the traditional top-down method. It essentially involves randomizing both attribute and cut-point selection strongly while splitting a node of a tree. In the extreme situation, it creates fully randomized trees having structures independent of the output values of the training sample. It mainly differs from other tree-based ensemble methods on two counts, which are that it splits nodes by picking cut-points fully at random, and it also uses the whole training sample (instead of bootstrap replica) to grow the trees. The predictions of all the trees are combined to establish the final prediction, through majority vote. The idea behind extra-trees classifier is that the full randomization of the cut-point and attribute together with ensemble averaging will decrease variance better than weaker randomization strategy used by other methods. The usage of all of the original training samples instead of bootstrap replicas is to decrease bias. Computational efficiency is a major strength of this algorithm [58]. Like the other algorithms, Extra trees algorithm has also seen an extensive and diverse application in the literature. Some of the recent applications include classification of land cover using Extremely Randomized Trees [59], and a multi-layer intrusion detection system with Extra Trees feature selection, extreme learning machine ensemble, and softmax aggregation [60].

Voting Classifier
Voting classifier combines different types of machine learning classifiers, aggregating the output of each classifier passed to it, and makes the final prediction of the class label of a new instance based on voting. The voting may be either hard or soft. In the case of hard voting simple majority voting is used. In this case, the class that gets the greatest number of votes will be selected (predicted). For soft voting, a prediction is made by averaging the class-probabilities of each classifier. The class that gets the best average probability is predicted. In this work, we adopted soft voting. Also, the tree-based ensemble classifiers are used as the base estimators for the VC.

Evaluation Metric
To evaluate performance of the ensemble ML models the following evaluation criteria are used: (a) accuracy, (b) precision, (c) recall, (d) F1-score (e) specificity (f) area under receiver operating characteristics curve (AUC-ROC). These metrics are classical quality criteria used to quantify performance of ML models. Below are their definitions: Accuracy: the percentage of entire instances rightly predicted by the classifier Precision: the proportion of positive instances rightly predicted by the classifier out of all the instances predicted by the classifier as positive.
Recall: the proportion of positive instances rightly predicted by the classifier out of all the instances that are actually positive.
f1 score: presents a harmonic mean of precision and recall Specificity: the proportion of negative instances rightly predicted by the classifier out of the total instances that are actually negative.
where tp = true positive, fp = false positive, tn = true negative, and fn = false negative, AUC-ROC: ROC is a probability curve that displays in a graphical way the trade-off between recall and specificity. AUC measures the ability of the classifier to distinguish between the positive and negative classes. A perfect classifier will have AUC of one, and worst performing classifier will have AUC of 0.5. Table 2 presents the ten-fold cross validation accuracy scores of the tree-based ensemble ML models on the training set. The scores of Random Forest range from 0.8131 to 0.8884. AdaBoost scores range from 0.8157 to 0.9127. XGBoost scores range from 0.8122 to 0.8991. The accuracy scores range of Bagging classifier is from 0.8034 to 0.8781. Extra trees classifier has an accuracy score range of 0.8087 to 0.9027. The Voting classifier scores range from 0.8191 to 0.9019. The accuracy score of AdaBoost is the best on BAC, S&P_500, DJIA, KMX, TATASTEEL, and HCLTECH training datasets. Extra trees, and Voting classifiers recorded the highest accuracy on MSFT, and XOM training data sets respectively. In general, the mean accuracy of the AdaBoost model is the best among the tree-based ensemble models. A boxplot of the ten-fold cross validation accuracy scores of the ML models on the training data sets is presented by Figure 1.   Table 3 shows the accuracy results of the tree-based ensemble ML models on the test datasets. The accuracy results of Random Forest range from 0.7565 to 0.8375. Adaboost has accuracy outcomes that range from 0.7306 to 0.8702. XGBoost accuracy scores range from 0.7667 to 0.8498. Bagging classifier accuracy results range from 0.7620 to 0.8391. The accuracy results of Extra Trees classifier range from 0.7889 to 0.8594. Voting Classifier accuracy results range from 0.7917 to 0.8552. AdaBoost produced the highest accuracy performance on XOM and TATASTEEL data sets. Extra Trees recorded the best accuracy performance on DJIA, and HCLTECH data sets. Voting classifier generated the best performance on MSFT, and KMX data sets. The Extra Trees classifier, and the Voting classifier produced the same and highest accuracy performance on BAC, and S&P_500 data sets. Overall, the Extra Trees classifier generated the best mean accuracy performance. Figure 2 provides the box plot of the accuracy scores of the machine learning models on the test data sets.   Table 3 shows the accuracy results of the tree-based ensemble ML models on the test datasets. The accuracy results of Random Forest range from 0.7565 to 0.8375. Adaboost has accuracy outcomes that range from 0.7306 to 0.8702. XGBoost accuracy scores range from 0.7667 to 0.8498. Bagging classifier accuracy results range from 0.7620 to 0.8391. The accuracy results of Extra Trees classifier range from 0.7889 to 0.8594. Voting Classifier accuracy results range from 0.7917 to 0.8552. AdaBoost produced the highest accuracy performance on XOM and TATASTEEL data sets. Extra Trees recorded the best accuracy performance on DJIA, and HCLTECH data sets. Voting classifier generated the best performance on MSFT, and KMX data sets. The Extra Trees classifier, and the Voting classifier produced the same and highest accuracy performance on BAC, and S&P_500 data sets. Overall, the Extra Trees classifier generated the best mean accuracy performance. Figure 2 provides the box plot of the accuracy scores of the machine learning models on the test data sets.   Table 4 displays precision results of the tree-based ensemble ML models on the test datasets. The precision scores of Random Forest range from 0.8033 to 0.9085. Adaboost precision scores range from 0.8372 to 0.9277. XGBoost has precision scores ranging from 0.8242 to 0.8822. Bagging Classifier precision results range from 0.7855 to 0.8841. The precision outcomes of Extra Trees range from 0.8298 to 0.9057. Voting Classifier precision scores range from 0.8297 to 0.8934. Random Forest recorded the best precision value on XOM data set. AdaBoost produced the highest precision values on S&P_500, MSFT, DJIA, and TATASTEEL datasets. Extra Trees generated the best precision scores on KMX, and HCLTECH data sets. On the whole, AdaBoost recorded the highest precision mean score. Figure 3 presents the boxplot of the precision results of the tree-based ensemble machine learning models on the test data sets.     MSFT, DJIA, and TATASTEEL datasets. Extra Trees generated the best precision scores on KMX, and HCLTECH data sets. On the whole, AdaBoost recorded the highest precision mean score. Figure 3 presents the boxplot of the precision results of the tree-based ensemble machine learning models on the test data sets.   MSFT, DJIA, and TATASTEEL datasets. Extra Trees generated the best precision scores on KMX, and HCLTECH data sets. On the whole, AdaBoost recorded the highest precision mean score. Figure 3 presents the boxplot of the precision results of the tree-based ensemble machine learning models on the test data sets.     Table 5 presents recall outputs of the tree-based ensemble ML models on the test datasets. The recall outputs of Random Forest range from 0.6622 to 0.9089. Adaboost has recall scores that range from 0.5731 to 0.8750. The recall outcomes of XGBoost range from 0.6857 to 0.8940. Bagging Classifier has recall scores ranging from 0.7286 to 0.8962. Extra trees recall scores range from 0.7008 to 0.9089. The Voting Classifier has a recall results ranging from 0.7176 to 0.8983. Random Forest recorded the best recall value on TATASTEEL data set. AdaBoost produced the highest recall scores on BAC, and XOM data sets. XGBoost generated the best recall values on KMX, and HCLTECH data sets. The recall results generated by Bagging is the best on S&P_500, and KMX data sets. Extra Trees recorded the highest recall values on DJIA, and TATASTEEL. In general, the Voting Classifier produced the highest mean recall score. A boxplot illustrating the recall scores of the ensemble machine learning models on the test data sets is given by Figure 4. The Voting Classifier has a recall results ranging from 0.7176 to 0.8983. Random Forest recorded the best recall value on TATASTEEL data set. AdaBoost produced the highest recall scores on BAC, and XOM data sets. XGBoost generated the best recall values on KMX, and HCLTECH data sets. The recall results generated by Bagging is the best on S&P_500, and KMX data sets. Extra Trees recorded the highest recall values on DJIA, and TATASTEEL. In general, the Voting Classifier produced the highest mean recall score. A boxplot illustrating the recall scores of the ensemble machine learning models on the test data sets is given by Figure 4.    Table 6. shows F1 scores of the tree-based ensemble ML models on the test datasets. The F1 scores of Random range from 0.7498 to 0.8529. AdaBoost F1 results range from 0.7009 to 0.8722. XGBoost has F1 scores which range from 0.7640 to 0.8577. The F1 scores of Bagging Classifier range from 0.7803 to 0.8494. Extra Trees F1 scores range from 0.7853 to 0.8675. Voting Classifier has F1 score ranging from 0.7915 to 0.8627. AdaBoost generated the best F1 results on XOM, and TATSTEEL data sets. The Extra Trees classifier generated F1 results superior to the rest of the models on BAC, DJIA, and HCLTECH data set. The F1 values recorded by Voting Classifier is the highest on XOM, MSFT, and KMX data sets. Overall, the Extra Trees Classifier has the highest average F1score. A boxplot of the F1 scores of the tree-based ensemble machine learning models on the test data sets is illustrated by Figure 5.   Overall, the Extra Trees Classifier has the highest average F1score. A boxplot of the F1 scores of the tree-based ensemble machine learning models on the test data sets is illustrated by Figure 5.   Table 7 presents specificity of the tree-based ensemble ML models on the test datasets. The specificity results of Random Forest range from 0.7717 to 0.9189. AdaBoost specificity scores range from 0.8194 to 0.9366. XGBoost has specificity scores ranging from 0.8043 to 0.8887. The specificity outcomes of Bagging Classifier range from 0.7381 to 0.8981. Extra trees specificity scores range from 0.8087 to 0.9132. Voting Classifier has specificity scores which range from 0.8109 to 0.9019. Random Forest generated the best specificity score on XOM data set. AdaBoost produced the highest specificity results on S&P_500, MSFT, DJIA, and TATSTEEL data sets. Bagging classifier recorded the best specificity performance on BAC data set. Extra Trees classifier generated the best specificity outcomes on KMX, and HCLTECH data sets. In general, AdaBoost classifier has the highest mean specificity score. Figure 6 presents the boxplot of the specificity scores of the tree-based ensemble machine learning models on the test data sets.   Table 7 presents specificity of the tree-based ensemble ML models on the test datasets. The specificity results of Random Forest range from 0.7717 to 0.9189. AdaBoost specificity scores range from 0.8194 to 0.9366. XGBoost has specificity scores ranging from 0.8043 to 0.8887. The specificity outcomes of Bagging Classifier range from 0.7381 to 0.8981. Extra trees specificity scores range from 0.8087 to 0.9132. Voting Classifier has specificity scores which range from 0.8109 to 0.9019. Random Forest generated the best specificity score on XOM data set. AdaBoost produced the highest specificity results on S&P_500, MSFT, DJIA, and TATSTEEL data sets. Bagging classifier recorded the best specificity performance on BAC data set. Extra Trees classifier generated the best specificity outcomes on KMX, and HCLTECH data sets. In general, AdaBoost classifier has the highest mean specificity score. Figure 6 presents the boxplot of the specificity scores of the tree-based ensemble machine learning models on the test data sets.    Figure 7 provides the boxplot of the AUC measure of the treebased ensemble machine learning models on the test data sets.  The ROC curve of all the tree-based ensemble ML on BAC, XOM, S&P 500, MSFT, DJIA, KMX, TATASTEEL, and HCLTECH stock datasets are displayed by Figures 8-15 respectively. This presents a model with good separability and ROC curve passing close to the upper left corner.   Figure 7 provides the boxplot of the AUC measure of the tree-based ensemble machine learning models on the test data sets.    Figure 7 provides the boxplot of the AUC measure of the treebased ensemble machine learning models on the test data sets.  The ROC curve of all the tree-based ensemble ML on BAC, XOM, S&P 500, MSFT, DJIA, KMX, TATASTEEL, and HCLTECH stock datasets are displayed by Figures 8-15 respectively. This presents a model with good separability and ROC curve passing close to the upper left corner. The ROC curve of all the tree-based ensemble ML on BAC, XOM, S&P 500, MSFT, DJIA, KMX, TATASTEEL, and HCLTECH stock datasets are displayed by Figures 8-15 respectively. This presents a model with good separability and ROC curve passing close to the upper left corner.                     A quantitative procedure utilizing Kendall W test of concordance is applied to rank the effectiveness of the tree-based ML algorithms in predicting the direction of stock price movement. In the study, we used a cut-off value of 0.05 for the significance level (p-value). We considered the Kendall's coefficient significant and capable of giving an overall ranking when p < 0.05. The critical value for chi-square ( 2  ) at p = 0.05 for five (5) degrees of freedom is 11.071. The degrees of freedom is equal to the total number of ML algorithms minus one. In this work, six ML algorithms are used giving us 5 degrees of freedom. Table 9 shows the results of Kendall's W tests in using accuracy of the ten-fold cross validation on the training data sets. The outcomes of Kendall's W tests in using accuracy, precision, recall, F1-measure, specificity, and AUC metrics on the test data sets are displayed by Tables 10-15 below respectively. Analysis of Table 9 indicates that Kendall's coefficient using the accuracy of the ten-fold crossvalidation on the training data set is significant (p < 0.05, 2  > 11.071). The performance of AdaBoost classifier is superior the rest of the ensemble models. The overall ranking is AdaBoost >VC > ET > XGBoost > RF > BC.  A quantitative procedure utilizing Kendall W test of concordance is applied to rank the effectiveness of the tree-based ML algorithms in predicting the direction of stock price movement. In the study, we used a cut-off value of 0.05 for the significance level (p-value). We considered the Kendall's coefficient significant and capable of giving an overall ranking when p < 0.05. The critical value for chi-square ( 2  ) at p = 0.05 for five (5) degrees of freedom is 11.071. The degrees of freedom is equal to the total number of ML algorithms minus one. In this work, six ML algorithms are used giving us 5 degrees of freedom. Table 9 shows the results of Kendall's W tests in using accuracy of the ten-fold cross validation on the training data sets. The outcomes of Kendall's W tests in using accuracy, precision, recall, F1-measure, specificity, and AUC metrics on the test data sets are displayed by Tables 10-15 below respectively. Analysis of Table 9 indicates that Kendall's coefficient using the accuracy of the ten-fold crossvalidation on the training data set is significant (p < 0.05, 2  > 11.071). The performance of AdaBoost classifier is superior the rest of the ensemble models. The overall ranking is AdaBoost >VC > ET > XGBoost > RF > BC. A quantitative procedure utilizing Kendall W test of concordance is applied to rank the effectiveness of the tree-based ML algorithms in predicting the direction of stock price movement. In the study, we used a cut-off value of 0.05 for the significance level (p-value). We considered the Kendall's coefficient significant and capable of giving an overall ranking when p < 0.05. The critical value for chi-square (χ2 ) at p = 0.05 for five (5) degrees of freedom is 11.071. The degrees of freedom is equal to the total number of ML algorithms minus one. In this work, six ML algorithms are used giving us 5 degrees of freedom. Table 9 shows the results of Kendall's W tests in using accuracy of the ten-fold cross validation on the training data sets. The outcomes of Kendall's W tests in using accuracy, precision, recall, F1-measure, specificity, and AUC metrics on the test data sets are displayed by Tables 10-15 below respectively. Analysis of Table 9 indicates that Kendall's coefficient using the accuracy of the ten-fold cross-validation on the training data set is significant (p < 0.05, χ2 > 11.071). The performance of AdaBoost classifier is superior the rest of the ensemble models. The overall ranking is AdaBoost > VC > ET > XGBoost > RF > BC. Analysis of Table 10 shows that Kendall's coefficient using the accuracy measure on the test data set is significant (p < 0.05, χ2 > 11.071). The performance of Extra Tree Classifier is superior to the rest of the ensemble models. The overall ranking is ET >VC > Adaboost > XGBoost > RF > BC.  Table 11. indicates that Kendall's coefficient using the precision metric on the test data set is significant (p < 0.05, χ2 > 11.071). Extra Tree Classifier is the foremost performer among the ML ensemble models. The overall ranking is ET > AdaBoost > VC > RF > BC > XGBoost. An analysis of Table 12 shows that Kendall's coefficient using the recall metric on the test data set is not significant (p > 0.05, χ2 < 11.071), hence this measure cannot be used to rank the ML ensemble models.  Table 13 indicates that Kendall's coefficient using the F1 score on the test data set is significant (p < 0.05, χ2 > 11.071) and that the performance of Extra Tree Classifier surpass that of the other ensemble ML models. The overall ranking is ET >VC > Adaboost = XGBoost > RF > BC. Analysis of Table 14 demonstrates that Kendall's coefficient test using the specificity metric on the test data set is not significant (p > 0.05, χ2 < 11.071), hence this measure cannot be used to rank the ML ensemble models.  Table 15 shows that Kendall's coefficient using the AUC metric on the test data set is significant (p < 0.05, χ2 > 11.071) and that the performance of Extra Tree classifier is ranked the highest among the ensemble ML models. The overall ranking is ET >VC > XGBoost > Adaboost > RF > BC.

Conclusions
This paper evaluated and compared the effectiveness six different tree-based ensemble machine learning algorithms in predicting the direction of stock price movement. Stock data were randomly collected from three different stock exchanges. Each datum was split into two sets, the training set and the test set. The models were evaluated with ten-fold cross validation accuracy on the training set. In addition, the models were evaluated on the test set using accuracy, precision, recall, f1-score, specificity, and AUC metrics. The Kendall W test of concordance was adopted to rank the effectiveness of the tree-based ML algorithms. The experimental results indicated that for the ten-fold cross validation accuracy of the training set, the AdaBoost model outperformed the other models. For the test data, only accuracy, precision, f1-score, and AUC metrics were able to generate results significant to rank the different models using Kendall W test of concordance. The Extra Tree model performed better than the rest of the models on the test data set. A limitation of this study is that it considered only tree-based ensemble models. Hence, in our future work, we will incorporate machine learning models that involve the Gaussian process, a regularization technique, and kernel-based techniques.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A   Table A1. Description of Overlap Studies Indicators used in the study.

Overlap Studies Indicators Description
Bollinger Bands (BBANDS) Describes the different highs and lows of a financial instrument in a particular duration.
Weighted Moving Average (WMA) Moving average that assign a greater weight to more recent data points than past data points Exponential Moving Average (EMA) Weighted moving average that puts greater weight and importance on current data points, however, the rate of decrease between a price and its preceding price is not consistent.
Double Exponential Moving Average (DEMA) It is based on EMA and attempts to provide a smoothed average with less lag than EMA.
Kaufman Adaptive Moving Average (KAMA) Moving average designed to be responsive to market trends and volatility.  Table A2. Description of Volume Indicators used in the study.

Volume Indicator Description
Chaikin A/D Line (ADL) Estimates the Advance/Decline of the market.

Momentum Indicators Description
Average Used to estimate whether a security is overbought or oversold. It measures RSI over its own high/low range over a specified period.
Ultimate Oscillator (ULTOSC) Estimates the price momentum of a security asset across different time frames.
Williams' %R (WILLR) Indicates the position of the last closing price relative to the highest and lowest price over a time period.