5.2. Data Description and Preparation
The Nepal Stock Exchange (NEPSE) serves as the primary stock market in Nepal, facilitating the trading of securities and promoting investment opportunities within the country. Established in 1993 under the Securities Exchange Act, NEPSE plays a crucial role in fostering economic growth and development by providing a platform for companies to raise capital and for investors to participate in the financial markets. As the sole stock exchange in Nepal, NEPSE oversees the listing and trading of various financial instruments, including stocks, bonds, and mutual funds, contributing to the country’s financial infrastructure and market stability (
Shrestha and Lamichhane 2021,
2022).
The close price of the NEPSE index and the financial news headlines from 1 January 2019 to 31 December 2023 were scraped from the portal Share Sansar (
www.sharesansar.com) using a web crawler. Share Sansar is a prominent online platform in Nepal dedicated to stock market news and analysis, providing a wealth of data, including stock prices, market trends, company financials, and news articles.
This study aimed to perform a comparative analysis of the performance of machine learning models in predicting the next day’s movement direction of the NEPSE index, the sole stock exchange of Nepal.
Figure 4 illustrates the original time series data of the index closing prices from 1 January 2019 to 31 December 2023. The time span covers both bear and bull markets. The vertical dotted line in the graph indicates the maximum close price, which occurred on 9 September 2021 just before the spread of COVID-19 in Nepal. The time span was selected intentionally to train the models to capture the patterns in both normal and volatile market situations and then to test their performance in similar environments. We defined the direction, a dichotomous variable, using the following equation:
where
represents the closing price of the current day, while
represents the closing price of the following day. Over the five years from the beginning of 2019 through the end of 2024, there were 551 instances of the NEPSE moving upward (positive direction) and 584 instances of it moving downward (negative direction).
Before computing sentiment scores, raw text data should undergo preprocessing steps such as removing stop words, handling punctuation, and stemming or lemmatization to enhance the accuracy and effectiveness of sentiment analysis algorithms. We cleaned up the raw text data (news headlines) using the functions
,
,
,
,
,
,
stopwords.words(‘english’),
, and
sequentially, which were extracted from the
library (
Bird 2006).
Now, the text data were ready (clean) for sentiment analysis. Valence Aware Dictionary and Sentiment Reasoning (VADER) (
Hutto and Gilbert 2014) and TextBlob (
Loria 2018) are the most common and reliable tools to compute the sentiment score (
Aljedaani et al. 2022;
Al-Natour and Turetken 2020;
Bonta et al. 2019). VADER is a pretrained model that analyzes people’s opinions, sentiments, evaluations, attitudes, and emotions via computational treatment regarding polarity (positive/negative) and intensity (strength) in text. It relies on an English dictionary that maps lexical features to their semantic orientation as positive or negative (
Liu 2010). TextBlob is a powerful and user-friendly Python library for natural language processing (NLP). Built on the
(
Bird 2006) and Pattern libraries, TextBlob simplifies the complexities of working with textual data by providing a simple API for common NLP tasks. It offers a range of functionalities, including part-of-speech tagging, noun-phrase extraction, sentiment analysis, classification, translation, and more. TextBlob’s strength lies in its ease of use, making it an excellent choice for beginners and researchers alike. With its straightforward syntax and extensive documentation, TextBlob enables developers to quickly implement NLP features in their applications without the need for extensive background knowledge in linguistics or machine learning.
Both tools (VADER and TextBlob) have their strengths and weaknesses, and the choice between them depends on factors such as the specific use case, the type of text data being analyzed, and the desired level of accuracy and flexibility. After preprocessing, the data were fed into TextBlob and VADER to determine their corresponding sentiment scores. We labeled the sentiment score from TextBlob as TextBlob sentiment data and the sentiment score from VADER as VADER sentiment data. This labeling will help readers easily identify and understand the source of the data.
The news data contained a large amount of news published in a single day. So, we computed the average news score for each day. The NEPSE runs from Sunday to Thursday, but financial news headlines are published seven days a week. We believe that news on Thursday and the weekend has an impact on the stock price on Sunday (the first trading day of the next week). Therefore, we computed the mean sentiment score of Thursday, Friday, and Saturday, which we used to predict the movement direction of the NEPSE index for Sunday. Finally, the sentiment scores for the financial news data were aligned with the movement direction of the index date, as shown in
Figure 5.
5.4. Hyperparameter Tuning
A machine learning model is a mathematical model with several parameters that need to be learned from the data during the training process. However, there are some parameters, known as hyperparameters, that cannot be directly learned. These hyperparameters play a crucial role in shaping the performance and behavior of the machine learning models. Researchers commonly choose the hyperparameters based on some intuition, previous study, hit and trial, or cross validation before the actual training begins. These parameters exhibit their importance by improving the model’s performance such as its complexity or its learning rate. Models can have many hyperparameters, and finding their best combination can be treated as a search problem.
The standard LR contains no hyperparameter. On the other hand, an SVM is equipped with three hyperparameters: kernel, gamma, and C. The kernel defines the type of decision boundary the SVM will use, while adjusting gamma is crucial for controlling the SVM’s flexibility and avoiding overfitting. The regularization parameter C influences the trade-off between achieving a smooth decision boundary and accurately classifying the training data. Higher C values encourage the SVM to fit the training data more closely, while lower values promote a smoother decision boundary that generalizes better to new data. Balancing these hyperparameters is essential for optimizing SVM performance across different datasets and tasks.
These hyperparameters of the SVM were tuned by using the grid search algorithm of sklearn (
Hao and Ho 2019). In the initial phase, we experimented with four possible kernels: linear, sigmoid, radial basis function (RBF), and polynomial (degree 3). For each kernel, we varied C and gamma between 1 and 51 and 0.01 to 1 with an increment of 1 and 0.01, respectively. For each possible combination of kernel, C, and gamma, the SVM was fitted using 10 cross-validation methods, and the results are presented in
Table 3 and
Table 4 for TextBlob sentiment data and VADER sentiment data, respectively. According to the findings presented in these tables, the most optimal hyperparameters for TextBlob sentiment data are as follows: a kernel of RBF, a C value of 1.0, and a gamma value of 0.5. For VADER sentiment data, the optimal settings of hyperparameters are a sigmoid kernel, a C value of 13, and a gamma value of 0.83. In order to enhance readability for the readers, these outcomes are highlighted within the tables.
Similar to the SVM, random forest (RF) is also equipped with hyperparameters. Three key hyperparameters in RF are n_estimators, max_depth, and max_leaf_nodes. The n_estimators parameter determines the number of trees in the forest, and a higher value generally leads to a more robust model but increases computational cost. Max_depth controls the maximum depth of each decision tree in the ensemble, regulating the complexity of individual trees. A deeper tree can capture more complex relationships in the data but may lead to overfitting. Finally, max_leaf_nodes limits the total number of terminal nodes or leaves in each tree, aiding in the prevention of overfitting by restricting the granularity of the tree structure. Tuning these hyperparameters allows practitioners to strike a balance between model complexity and generalization, tailoring the RF for optimal performance on diverse datasets.
These hyperparameters of RF were tuned by using the grid search algorithm of sklearn (
Hao and Ho 2019). We varied n_ estimators, max_depth, and max_leaf_nodes between 1 and 101, 1 and 21, and 1 and 21 with an increment of 5, 2, and 2, respectively. For each possible combination of these hyperparameters, the RF was fitted using ten-fold cross-validation methods, and the results are presented in
Table 5. Based on the results showcased in
Table 5, the most optimal hyperparameters for TextBlob sentiment data are outlined as follows: max_depth = 3, max_leaf_nodes = 3, and n_estimators = 71. As for VADER sentiment data, the ideal hyperparameter settings are max_depth = 11, max_leaf_nodes = 11, and n_estimators = 6.
XGBoost, an efficient and scalable implementation of gradient boosting, boasts several critical hyperparameters that significantly impact its performance. The n_estimators hyperparameter determines the number of boosting rounds or trees to be built, affecting the overall complexity and predictive power of the model. The max_depth parameter controls the maximum depth of each tree, regulating the model’s capacity and potential for overfitting. The learning_rate, often denoted as eta, scales the contribution of each tree, influencing the step size during optimization. A smaller learning rate typically requires more boosting rounds but can enhance generalization. Lastly, the min_child_weight hyperparameter sets the minimum sum of instance weight needed in a child, impacting tree partitioning and addressing imbalances in the data. These hyperparameters collectively play a crucial role in fine-tuning the XGBoost algorithm for optimal performance, balancing model complexity and generalization across diverse datasets.
These hyperparameters of XGBoost were also tuned by using the grid search algorithm of sklearn (
Hao and Ho 2019). In the initial phase, we experimented with four possible eta values: 0.0001, 0.001, 0.01, and 0.1. For each eta, we varied n_ estimators, max_depth, and min_child_weight between 1 and 101, 1 and 21, and 1 and 10 with an increment of 5, 2, and 2, respectively. For each possible combination of these four hyperparameters, the XGBoost was fitted using ten-fold cross-validation methods, and the results are presented in
Table 6. According to the findings presented in
Table 6, the most favorable hyperparameters for TextBlob sentiment data are delineated as follows: max_depth = 3, eta = 0.1, n_estimators = 21, and min_child_weight = 9. Meanwhile, for VADER sentiment data, the optimal hyperparameter configurations consist of max_depth = 13, eta = 0.1, n_estimators = 36, and min_child_weight = 5.
Hyperparameters play a crucial role in shaping the performance and behavior of a neural network model. The number of neurons in the hidden layer determines the model’s capacity to capture complex patterns in the data, with larger numbers potentially enabling better representation but also increasing the risk of overfitting. The learning rate influences the size of steps taken during the optimization process, affecting the convergence speed and stability of the model. Batch size determines the number of data points used in each iteration of training, impacting the efficiency and generalization ability of the network. The choice of optimizer, such as stochastic gradient descent (SGD) or Adam, affects how the model adjusts its weights to minimize the loss function. Finding the optimal combination of these hyperparameters is often a challenging task, requiring careful tuning to achieve optimal model performance.
Due to the small size of the data, a single hidden layer NN model was used. The rest of the hyperparameters of the network were tuned by using the grid search algorithm of sklearn (
Hao and Ho 2019). In the initial phase, we constructed all possible combinations of hyperparameters. This includes five different optimizers (adam, rmsprop, adagrad, nadam, and adadelta); three different learning rates (0.1, 0.01, and 0.001); four options for batch sizes (8, 16, 32, and 64); and seven different possible values of neurons in the hidden layer (1, 5, 10, 15, 20, 25, and 50). Therefore, 5 × 3 × 4 × 7 = 420 possible choices were available for each model to identify its best combination. We then constructed a neural network architecture with an input layer, a hidden layer with relu activation, followed by a fully connected output layer with sigmoid activation. This network was fitted with all 420 possible combinations of the hyperparameters using ten-fold cross-validation methods, and the results are presented in
Table 7. Based on the results displayed in
Table 7, the best set of optimal hyperparameters for TextBlob sentiment data are as follows: # neurons = 50, eta = 0.1, optimizer = adam, and batch_size = 8. In contrast, for VADER sentiment data, the most favorable set of hyperparameters are as follows: # neurons = 25, eta = 0.001, optimizer = nadam, and batch_size = 8.
5.5. Model Comparison
The models LR, SVM, RF, and XGBoost were tested on the test dataset with their respective optimized hyperparameters obtained in
Section 5.4 in place. The performance scores achieved by the models are presented in
Table 8 for TextBlob sentiment data and
Table 9 for VADER sentiment data, demonstrating the approaches’ effectiveness.
Once the hyperparameters of the NN model were tuned, the models were set with their corresponding optimized hyperparameters obtained in
Section 5.4. The epochs were set to 500, and an early stopping criterion was implemented to address the consequences of underfitting and overfitting that can occur when training neural networks. It allowed us to specify a large number of epochs and stop training when the model’s performance stopped improving on the validation data (
Brownlee 2018). Furthermore, a 10% dropout was implemented between the hidden and fully connected output layers to prevent overfitting. The binary cross entropy was used as the loss function during the training process. During the training of the ANN, the loss function behaved on the number of epochs, as illustrated in
Figure 6. The model overfitted after 300 epochs on TextBlob sentiment data, whereas there was no further improvement after 250 epochs on VADER sentiment data.
To address the stochastic behavior of the ANNs, we replicated each of the models 50 times, and the corresponding metrics were computed on the test dataset. The average metrics together with their respective standard deviation are presented in
Table 8 for TextBlob sentiment data and
Table 9 for VADER sentiment data.
The standard deviations of 0.0202, 0.0135, 0.0050, and 0.0032 on sensitivity, specificity, accuracy, and AUC in
Table 8 are comparatively very small, which reveals the fact that the NN model looks consistent on the TextBlob sentiment data. On the other hand, the standard deviations of 0.3727 and 0.3028 on sensitivity and specificity in
Table 9 are comparatively large, which reveals the fact that the NN model looks inconsistent on the VADER sentiment data.
The metrics of sensitivity, specificity, accuracy, and AUC of the models LR, SVM, RF, XGBoost, and ANN are provided in
Table 8 for TextBlob sentiment data and in
Table 9 for VADER sentiment data. These matrices are further visualized by the multiple bar plots in
Figure 7.
The sensitivity values provided (0.2857 for LR, 0.2041 for SVM, 0.3469 for RF, 0.3163 for XGBoost, and 0.2806 for ANN) for the TextBlob sentiment data represent the ability of each machine learning model to correctly identify the proportion of the actual positive cases (NEPSE moving upward) that were correctly predicted as positive. A higher sensitivity indicates a better ability to detect positive instances. In this comparison, the RF model had the highest sensitivity of 0.3469, followed by XGBoost with a sensitivity of 0.3163. LR and ANN had sensitivities of 0.2857 and 0.2806, respectively, and SVM showed the lowest sensitivity of 0.2041. On the other hand, for the VADER sentiment data, the LR model had the highest sensitivity of 0.9184, followed by XGBoost with a sensitivity of 0.5918. RF and SVM had sensitivities of 0.5612 and 0.3163, respectively, while ANN showed the lowest sensitivity at 0.3143. These sensitivity values are crucial metrics for evaluating the performance of classification models, especially in scenarios where correctly identifying positive cases is important.
The specificity values provided (0.9535 for LR, 1.0000 for SVM, 0.9147 for RF, 0.9302 for XGBoost, and 0.9603 for ANN) for the TextBlob sentiment data represent the ability of each machine learning model to correctly identify the proportion of the actual negative cases (NEPSE moving downward) that were correctly predicted as negative. A higher specificity indicates a better ability to detect negative instances. In this comparison, the SVM model had the highest specificity of 1.0, indicating perfect performance in identifying negative cases. ANN followed closely with a specificity of 0.9603, while LR, RF, and XGBoost had specificities of 0.9535, 0.9147, and 0.9302, respectively. On the other hand, for the VADER sentiment data, the ANN model had the highest specificity of 0.7414, followed by SVM with a specificity of 0.6279. XGBoost and RF had a specificity of 0.4574 and 0.4264, respectively, while LR showed the lowest specificity at 0.3143. These specificity values are important metrics for evaluating the overall accuracy and performance of classification models, especially in scenarios where correctly identifying negative cases is crucial.
The accuracy values provided (0.6652 for LR, 0.6564 for SVM, 0.6696 for RF, 0.6652 for XGBoost, and 0.6669 for ANN) for TextBlob data represent the overall correctness of predictions made by each machine learning model. A higher accuracy indicates better overall performance in correctly predicting both positive (NEPSE moving upward) and negative (NEPSE moving downward) cases. In this comparison, the RF model had the highest accuracy of 0.6696, followed closely by the ANN with an accuracy of 0.6669. LR and XGBoost both had accuracies of 0.6652, while SVM showed the lowest accuracy at 0.6564. On the other hand, for the VADER sentiment data, the ANN model had the highest accuracy of 0.5572, followed by LR with an accuracy of 0.5286. XGBoost and SVM had accuracies of 0.5154 and 0.4934, respectively, while RF showed the lowest accuracy of 0.4846. These accuracy values are fundamental metrics for assessing the general effectiveness and reliability of classification models in making correct predictions across different classes.
For the TextBlob sentiment data, in the provided AUC values (0.6682 for LR, 0.6682 for SVM, 0.6772 for RF, 0.6942 for XGBoost, and 0.6682 for ANN), the differences between LR, SVM, and ANN were negligible as they all had the same AUC score. However, RF and XGBoost stood out with higher AUC values. XGBoost led with an AUC of 0.6942, indicating superior discriminative ability, while RF followed closely with an AUC of 0.6772. On the other hand, for the VADER sentiment data, the ANN and LR model had the highest AUC of 0.5800, followed by XGBoost with an AUC of 0.5122. RF had an AUC of 0.4977, while SVM showed the lowest AUC of 0.4816. These AUC values are critical metrics for evaluating the overall performance and discriminative power of classification models in binary classification tasks.
Comparing the ROC curves for TextBlob sentiment data depicted in
Figure 8 (left) for the machine learning models LR, SVM, RF, XGBoost, and ANN with their associated AUC values offers valuable insights into their discriminatory capabilities. The ROC curves visually portray the balance between the true-positive rate (sensitivity) and false-positive rate (1-specificity) across varying threshold settings. Models with higher AUC values typically showcase ROC curves that approach the top-left corner, indicating superior discrimination between positive and negative classes. In this assessment, XGBoost emerges as the standout performer with the highest AUC of 0.6942, likely corresponding to a ROC curve close to the ideal top-left corner. RF follows closely with an AUC of 0.6772, signaling slightly lesser but still effective discriminatory performance. LR, SVM, and ANN exhibit similar AUC values of 0.6682, reflecting comparable discriminatory power evident in their ROC curves that lie closer to the diagonal line. Additionally, comparing the ROC curves for VADER sentiment data in
Figure 8 (right) for the same models with their AUC values further elucidates their discriminatory prowess, where LR and ANN show moderate discrimination with AUC values of 0.58, while SVM and RF demonstrate weaker discriminatory performance with AUC values of 0.4816 and 0.4977, respectively. XGBoost falls between these extremes with an AUC of 0.5122, indicating a moderate level of discrimination compared to the other models.
Moreover, although all five models exhibit greater sensitivity when applied to VADER sentiment data, they excel in terms of specificity, accuracy, and AUC when operating on TextBlob sentiment data. This suggests that the overall performance of all five models is superior when analyzing TextBlob sentiment data as opposed to VADER sentiment data.
5.6. Statistical Analysis
To evaluate the reliability of the models’ outcomes, a statistical analysis was conducted to determine significant performance differences among the five models. This analysis followed a hypothesis testing approach for two populations, as recommended by the authors (
Agresti and Caffo 2000;
Dahal and Amezziane 2020;
Wald and Wolfowitz 1940). The Wald test, renowned for its simplicity and widespread use, was utilized to compare population proportions between the models. The test aims to detect meaningful differences in proportions between groups or populations, with statistical significance determined through a
p-value derived from the standard normal distribution. In this study, the Wald method’s effectiveness was contingent upon a sufficiently large sample size (no. of observations on test data = 227), ensuring that the product of the sample size, accuracy percentage, and its complement is at least 10. The null hypothesis (
) posits that model prediction accuracies are not significantly different, while the alternative hypothesis (
) suggests the opposite. The results of multiple pairwise comparisons using the Wald test, along with corresponding
p-values, are presented in
Table 10 and
Table 11 for TextBlob sentiment data and in
Table 12 and
Table 13 for VADER sentiment data.
Examining
Table 10 and
Table 11, we find no evidence to reject the null hypothesis, as none of the
p-values are below the 0.05 threshold. This lack of significant
p-values indicates insufficient evidence to claim significant differences in model performance. Put simply, it is reasonable to conclude that all five models exhibit similar accuracy and AUC on TextBlob sentiment data. Similar conclusions are drawn from
Table 12, suggesting statistical similarity in accuracy across the models on VADER sentiment data. However,
Table 13 presents mixed results; specifically, LR versus SVM and SVM versus ANN show
p-values below 0.05. This signifies statistically significant differences in AUC performance between these pairs on VADER sentiment data, while the remaining pairs exhibit similar performances.