Evaluating the Efficacy of Large Language Models in Stock Market Decision-Making: A Decision-Focused, Price-Only, Multi-Country Analysis Using Historical Price Data
Abstract
1. Introduction
- It expands the scope of LLM assessment beyond language-centric tasks to structured numerical input, establishing a framework for decision-focused evaluation.
- This study examines LLM behavior under structured numerical input and frames evaluation around standardized decision labels.
- To the best of our knowledge, this is among the first large-scale, cross-country evaluations of LLMs on structured, price-only time series data, focused explicitly on generating decision labels such as buy, sell, or hold.
- It highlights the opportunities and limitations of using LLMs under a common prompt-based evaluation setting, offering insights into their comparative behavior in domains traditionally dominated by statistical and deep learning models.
2. Literature Review
2.1. Classical and Deep Learning-Based Feature Engineering in Financial Forecasting
2.2. LLMs in Financial Forecasting
2.3. Multi-Modal and Hybrid LLMs in Financial Forecasting
2.4. Retrieval-Augmented Generation (RAG) and LLMs in Financial Forecasting
2.5. Research Gaps and Questions
- RQ1: To what extent can LLMs (GPT-4.0, Gemini 2.0 Flash, LLaMA-4-Scout-17B-16E) generate accurate decision labels for short-term (1-month) and medium-term (3-month) investment outcomes when provided only with historical closing price data?
- RQ2: How does decision performance vary with contextual factors such as stock volatility, industry sector, and country of listing?
- RQ3: Which LLM demonstrates the greatest consistency and relative accuracy across different evaluation settings?
- RQ4: How does the type of investment question—buy, sell/hold, or pairwise comparison—affect model accuracy and reliability?
3. Materials and Methods
3.1. Stock Identification and Closing-Price Collection
3.2. Prompts
- (a)
- Buy Decision— should the investor purchase a particular stock, given its price history? (Repeated for both 1 and 3 months.)
- (b)
- Sell/Hold Decision—should the investor sell a currently held stock, or continue to hold it? (Same investment horizon as buy.)
- (c)
- Comparison Decision—between two candidate stocks, which is a better investment choice over a specified horizon? (Same investment horizon as buy.)
- System Prompt:You are a financial analyst.
- Sample Buy Prompt:Based on the daily closing price of stock [STOCK NAME] over the past [N] months (provided in a CSV file), should this stock be bought today for a [M]-month investment horizon? Please respond with Buy or Not to Buy.
- Sample Sell/Hold Prompt:Given the daily closing prices of stock [STOCK NAME] over the last [N] months (provided in a CSV file), and assuming the stock is currently held in the portfolio, should it be sold at the end of the next [M] months or held beyond that period? Please respond with Sell or Hold.
- Sample Comparison Prompt (Buy):Given the daily closing prices of two stocks—[STOCK A] and [STOCK B]—over the last [N] months (provided in two CSV files), which stock is more suitable to be bought today for a [M]-month investment horizon? Respond as None of them, Stock A, or Stock B.
- Sample Comparison Prompt (Sell/Hold):Given the daily closing prices of two stocks—[STOCK A] and [STOCK B]—over the last [N] months (provided in two CSV files), which stock is more suitable to be sell or hold today for a [M]-month investment horizon? Respond as None of them, Stock A, or Stock B.
3.3. Methodology
- Step 1:
- Identify Stocks and LLMs: A representative set of 150 stocks was selected from three countries—India, the United States, and South Africa. These stocks span ten major economic sectors, ensuring coverage of diverse market sizes, sectoral dynamics, and volatility levels. Three state-of-the-art LLMs were chosen for evaluation—OpenAI’s GPT-4.0, Google’s Gemini 2.0 Flash, and Meta’s LLaMA-4-Scout-17B-16E—providing a range of architectures and training paradigms.
- Step 2:
- Annotate with Sector and Volatility:Each stock was annotated with its corresponding industry sector following the Global Industry Classification Standard (GICS). To further capture the individual risk characteristics, we calculated each stock’s historical annualized volatility based on the dispersion of daily returns over the year preceding the prediction window.Daily returns were computed from closing prices using natural log returns for better statistical properties (especially for compounding) as follows:where is the closing price at day t and is the closing price at day .The daily variability (sample standard deviation) of the daily returns was calculated aswhere N is the number of trading days in the one-year window and is the sample mean of the daily returns over that window.The annualized volatility was then computed from the variability of daily returns using the conventional square-root-of-time rule, assuming 252 trading days per year.Stocks are then classified into three categories of volatility: low (0–5%), medium (5–10%), and high volatility stocks (>10%). These thresholds are used as practical within-study grouping cutoffs and should not be interpreted as universal market-wide definitions of low, medium, and high volatility. This method is also in line with the earlier literature, where relative volatility categorization is employed to study the differential behavior of returns for different volatility groups of stocks [49,50,51].This structured classification enabled comparative evaluation across both sectoral dynamics and volatility levels, thereby facilitating a deeper understanding of whether the model’s predictive accuracy and decision behavior varied across stable, moderately volatile, and highly volatile market environments.
- Step 3:
- Retrieve Historical Price Data via API: Daily closing prices were collected for all stocks from 1 March 2022, to 28 March 2025. Data sources included Yahoo Finance (US stocks), NSE India (Indian stocks), and the Johannesburg Stock Exchange (JSE) platform (South African stocks). Automated Python scripts were implemented to query these APIs and save each stock’s time series in a dedicated CSV file.
- Step 4:
- Prepare Prompts with CSV Data: For each stock, historical closing prices were formatted into structured CSV files that served as inputs for the LLMs. The prompts were designed to reflect three primary categories of investment decision-making tasks: (i) buy, (ii) sell/hold, and (iii) pairwise comparison. Each prompt specified the investment horizon (1 month or 3 months). To isolate the effect of numerical data, no market commentary or sentiment was included.
- Step 5:
- Query LLM APIs and Collect Predictions: Prompts were sent to the specific LLM APIs using Python scripts. Identical prompts were given to all models under controlled conditions. The temperature setting was fixed at zero to reduce random variation. The model outputs were not generated as free-form text; instead, they were limited to a specific set of responses, based on clear format rules included in the prompt design. For buy-side evaluations, each model was instructed to respond strictly with either “Buy” or “Not to Buy.” For sell-side tasks, the allowed responses were “Sell” or “Hold.” In pairwise comparison tasks, answers were restricted to “Stock A,” “Stock B,” or “None.”
- Step 6:
- Compare Predictions with Actual Outcomes: Model predictions were evaluated against realized stock price movements using explicitly defined forward-return rules. This definition is used consistently for evaluating buy/sell outcomes. For each stock and prediction date t, the forward return over an investment horizon T (either 1 month or 3 months) was computed from close-to-close prices asIn order to check the robustness of the results, we carried out a sensitivity check for the thresholds of and , shown in Table 2. As expected, the accuracy of the models changes depending on the thresholds applied due to the change in the strictness of the labels and the class distribution. The ranking of the models remains the same for all the thresholds applied: GPT-4.0 > LLaMA-4-Scout-17B-16E > Gemini 2.0 Flash. These results suggest that the comparative ordering of the models is stable across the tested thresholds, although absolute performance depends on the threshold choice. The intermediate threshold of is used as the default, as it offers better and stable performance across models, without the noise sensitivity of or the reduced signal frequency of .The threshold = 2.5% is used as a practical decision filter to exclude economically insignificant price movements and avoid labeling noise-driven fluctuations as actionable signals. In short-horizon settings, small price changes are often dominated by market microstructure effects and may not reflect meaningful investment opportunities. Therefore, the threshold is interpreted as a minimum return hurdle rather than a structural parameter.Ground-truth labels were derived using decision thresholds of , as summarized below.
- (a)
- Single-stock tasks
- Buy task:
- Sell task:
- (b)
- Pairwise (comparison) tasks
Let and denote the forward returns of Stock 1 and Stock 2, respectively.- Comparison Buy:
- Comparison Sell:
Each model prediction was scored as a binary outcome (1 = correct, 0 = incorrect) by comparing the predicted action with the corresponding ground-truth label derived from these rules. This explicit evaluation framework supports transparency and reproducibility in evaluating model decisions across all task types. Because fixed thresholds generate different event frequencies across contexts, the resulting labels should be interpreted as evaluation constructs within this protocol rather than as universally comparable trading events. - Step 7:
- Analyze Accuracy by Sector, Volatility, and Model: Accuracy scores were aggregated and analyzed across dimensions such as sector, volatility group, investment horizon, and model type. This enabled within-model and cross-model comparisons. Additional analyses assessed stability and consistency of performance across financial environments.
- Step 8:
- Compile Benchmark Report: The results were consolidated into a benchmark report, which included overall accuracy scores, sector- and volatility-based breakdowns, and comparative performance across investment question types. This benchmark provides a structured reference point for comparing LLM behavior across markets, sectors, and prompt types.
3.4. Dataset Preparation
| Algorithm 1 Volatility Estimation from Daily Closing Prices | |
| 1: | Input: |
Daily closing prices | |
Trading days per year D (default ) | |
Optional rolling window size w (days) | |
Volatility thresholds (default , ) | |
| 2: | Output: Daily volatility , annual volatility , volatility percentage , volatility class |
| 3: | Sort prices in chronological order |
| 4: | Remove missing or invalid price entries |
| 5: | number of remaining prices |
| 6: | if
then |
| 7: | return error (insufficient data) |
| 8: | end if |
| 9: | if rolling window w is provided then |
| 10: | for each window segment of length w (sliding by 1 day) do |
| 11: | Apply steps 11–18 to the prices inside the window |
| 12: | Record the windowed outputs |
| 13: | end for |
| 14: | return windowed , , , |
| 15: | end if |
| 16: | for to n do |
| 17: | Compute log-return: |
| 18: | end for |
| 19: | Number of returns: |
| 20: | Compute mean return: |
| 21: | Compute sample variance: |
| 22: | Daily volatility: |
| 23: | Annual volatility: |
| 24: | Volatility percentage: |
| 25: | Classify volatility: |
| 26: | return
|
3.5. Experimental Setup
3.6. Evaluation Metric
4. Results
4.1. RQ1: How Accurately Can LLMs Generate Investment Decisions for Short-Term (1-Month) and Medium-Term (3-Month) Horizons Using Only Historical Closing Price Data?
4.2. RQ2: How Does Decision Performance Vary with Contextual Factors Such as Stock Volatility, Industry Sector, and Country of Listing?
4.2.1. Volatility-Level Analysis
4.2.2. Country-Level Analysis
4.2.3. Sector-Level Analysis
4.2.4. Impact of Volatility, Sector, and Country
4.3. RQ3: Are Certain LLMs (e.g., GPT-4.0 vs. Gemini 2.0 Flash vs. LLaMA-4-Scout-17B-16E ) More Consistent in Their Decision Quality Across Different Financial Environments?
4.3.1. Model Comparison: GPT-4.0, Gemini 2.0 Flash, and LLaMA-4-Scout-17B-16E
4.3.2. Statistical Robustness
4.4. RQ4: How Does the Type of Investment Question (Buy, Sell/Hold, Pairwise Comparison) Affect Model Performance and Reliability
4.5. Sectoral Implications for Financial Theory
- Consumer Staples: The performance of the models in this sector was found to be relatively stable (60–65%). The financial theory that explains this performance is the Capital Asset Pricing Model (CAPM).
- Information Technology: The performance of the models in this sector was highly unstable, with accuracy ranging from 0 to 100%. The financial theory that explains this performance is the theory of ‘creative destruction’ [59].
- Healthcare: The performance of the models in this sector was strong, with accuracy rates often higher than 70%. The financial theory that explains this performance is explained by the authors of [60].
- Financials: The performance of the models in this sector was moderate, with accuracy rates ranging from 45 to 55%. The financial theory that explains this performance is explained by the author of [61].
- Real Estate: High but unstable accuracy at 60–100%; matches theory as it is a defensive sector with a degree of asset backing but is also sensitive to interest rate effects.
- Communication Services: Weakest performance, as low as 8–10%; sentiment-driven and intangible asset-dependent, as expected in a speculative asset class [64].
Effect of Investment Question Type
5. Conclusions
5.1. RQ1: Accuracy of LLMs in Short-Term and Medium-Term Forecasts
5.2. RQ2: Influence of Contextual Factors (Volatility, Sector, and Country)
5.3. RQ3: Consistency and Accuracy Across Models
5.4. RQ4: Impact of Investment Question Type
5.5. Practical Implications
5.6. Limitations and Future Directions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Prompts and Datasets
Appendix A.1. Prompts—United States
- Prompt:-Based on the daily closing price of stock {stock_name} over the past36 months, should this stock be bought at the current price for a3-month investment horizon? Answer with buy or not to~buy.Stock Data:{stock_info_text}{format_instructions}
- Prompt:-Based on the daily closing price of stock {stock_name} over the past36 months, should this stock be bought at the current price for a1-month investment horizon? Answer with buy or not to~buy.Stock Data:{stock_info_text}{format_instructions}
- Prompt:-Given the daily closing prices of stock {stock_name} over the last36 months, and~assuming the stock is currently held, should it be soldat the end of the next 3 months or held beyond that period?Please respond with "Sell" or "Hold".Stock Data:{stock_info_text}{format_instructions}
- Prompt:-Given the daily closing prices of stock {stock_name} over the last36 months, and~assuming the stock is currently held, should it be soldat the end of the next 1 months or held beyond that period?Please respond with "Sell" or "Hold".Stock Data:{stock_info_text}{format_instructions}
- Date,Close01-03-2022,160.387402-03-2022,163.689503-03-2022,163.3652
Appendix A.2. Prompts—South Africa (JSE)
- Prompt:-Based on the daily closing price of stock {stock_name} over the past36 months, should this stock be bought at the current price for a3-month investment horizon? Answer with buy or not to~buy.Stock Data:{stock_info_text}{format_instructions}
- Prompt:-Based on the daily closing price of stock {stock_name} over the past36 months, should this stock be bought at the current price for a1-month investment horizon? Answer with buy or not to~buy.Stock Data:{stock_info_text}{format_instructions}
- Prompt:-Given the daily closing prices of stock {stock_name} over the last36 months, and~assuming the stock is currently held, should it be soldat the end of the next 3 months or held beyond that period?Please respond with "Sell" or "Hold".Stock Data:{stock_info_text}{format_instructions}
- OpenAI/Gemini 2.0 Flash/LLaMA-4-Scout-17B-16E 1-month Sell (JSE)
- Date,Close2022-03-01 00:00:00+02:00,16,692.452022-03-02 00:00:00+02:00,16,702.432022-03-03 00:00:00+02:00,16,947.83
Appendix A.3. Prompts—India
- Prompt:-Based on the daily closing price of stock {stock_name} over the past36 months, should this stock be bought at the current price for a3-month investment horizon? Answer with buy or not to~buy.Stock Data:{stock_info_text}{format_instructions}
- Prompt:-Based on the daily closing price of stock {stock_name} over the past36 months, should this stock be bought at the current price for a1-month investment horizon? Answer with buy or not to~buy.Stock Data:{stock_info_text}{format_instructions}
- Prompt:-Given the daily closing prices of stock {stock_name} over the last36 months, and~assuming the stock is currently held, should it be soldat the end of the next 3 months or held beyond that period?Please respond with "Sell" or "Hold".Stock Data:{stock_info_text}{format_instructions}
- Prompt:-Given the daily closing prices of stock {stock_name} over the last36 months, and~assuming the stock is currently held, should it be soldat the end of the next 1 months or held beyond that period?Please respond with "Sell" or "Hold".Stock Data:{stock_info_text}{format_instructions}
- Date,Close2022-03-02,1374.552022-03-03,1370.252022-03-04,1366.6
Appendix A.4. Pairwise Comparison Prompts (US/JSE/India)
- Given the daily closing prices of two stocks-STOCK A and STOCK B-overthe last 36 months, which stock is more suitable to be bought todayfor a 1-month investment horizon? Please respond with the stock~name.Possible outputs: {stock_a_name}, {stock_b_name}, None.STOCK A ({stock_a_name}) Data:{stock_a_data}STOCK B ({stock_b_name}) Data:{stock_b_data}{format_instructions}
- Given the daily closing prices of two stocks-STOCK A and STOCK B-overthe last 36 months, which stock is more suitable to be bought todayfor a 3-month investment horizon? Please respond with the stock~name.Possible outputs: {stock_a_name}, {stock_b_name}, None.STOCK A ({stock_a_name}) Data:{stock_a_data}STOCK B ({stock_b_name}) Data:{stock_b_data}{format_instructions}
- Given the daily closing prices of two stocks-STOCK A and STOCK B-overthe last 36 months, and~assuming both stocks are currently held, whichstock is more suitable to be sold at the end of the next 3 months?Please respond with the stock~name.Possible outputs: {stock_a_name}, {stock_b_name}, None, Both.STOCK A ({stock_a_name}) Data:{stock_a_data}STOCK B ({stock_b_name}) Data:{stock_b_data}{format_instructions}
- Given the daily closing prices of two stocks-STOCK A and STOCK B-overthe last 36 months, and~assuming both stocks are currently held, whichstock is more suitable to be sold at the end of the next 1 month?Please respond with the stock~name.Possible outputs: {stock_a_name}, {stock_b_name}, None, Both.STOCK A ({stock_a_name}) Data:{stock_a_data}STOCK B ({stock_b_name}) Data:{stock_b_data}{format_instructions}
References
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini 2.0 Flash: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA-4-Scout-17B-16E: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
- Gupta, M.; Wei, C.; Czerniawski, T.; Eiris, R. PIDQA—Question Answering on Piping and Instrumentation Diagrams. Mach. Learn. Knowl. Extr. 2025, 7, 39. [Google Scholar] [CrossRef]
- Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
- Trust, P.; Minghim, R. A Study on Text Classification in the Age of Large Language Models. Mach. Learn. Knowl. Extr. 2024, 6, 2688–2721. [Google Scholar] [CrossRef]
- Guo, X.; Hu, A.; Santamaria, M.; Tajrobehkar, M.; Zhang, J. MFGLib: A library for mean-field games. arXiv 2023, arXiv:2304.08630. [Google Scholar] [CrossRef]
- Alenezy, A.H.; Ismail, M.T.; Wadi, S.A.; Jaber, J.J. Predicting stock market volatility using MODWT with HyFIS and FS.HGD models. Risks 2023, 11, 121. [Google Scholar] [CrossRef]
- Truong, L.D.; Friday, H.S.; Ngo, T.M. Market Reaction to Delisting Announcements in Frontier Markets: Evidence from the Vietnam Stock Market. Risks 2023, 11, 201. [Google Scholar] [CrossRef]
- Apau, R.; Sibindi, A.; Jeke, L. Effect of macroeconomic dynamics on bank asset quality under different market conditions: Evidence from Ghana. Risks 2023, 11, 158. [Google Scholar] [CrossRef]
- Sadorsky, P.; Henriques, I. Using US Stock Sectors to Diversify, Hedge, and Provide Safe Havens for NFT Coins. Risks 2023, 11, 119. [Google Scholar] [CrossRef]
- McClellan, M. AI and financial fragility: A framework for measuring systemic risk in deployment of generative AI for stock price predictions. J. Risk Financ. Manag. 2025, 18, 475. [Google Scholar] [CrossRef]
- Jin, Y.; Zhao, H.; Zhang, Q.; Xue, Y. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. arXiv 2024, arXiv:2310.01728. [Google Scholar] [CrossRef]
- Wang, R.; Chen, L.; Li, X. StockTime: A Time Series Specialized LLM Architecture for Stock Price Prediction. arXiv 2024, arXiv:2409.08281. [Google Scholar]
- Yu, W.; Liu, H.; Liu, Z.; Zhang, Y. Temporal Data Meets LLM: Explainable Financial Time Series Forecasting. arXiv 2023, arXiv:2306.11025. [Google Scholar] [CrossRef]
- Pei, J.; Zhang, Y.; Liu, T.; Yang, J.; Wu, Q.; Qin, K. ADTime: Adaptive Multivariate Time Series Forecasting Using LLMs. Mach. Learn. Knowl. Extr. 2025, 7, 35. [Google Scholar] [CrossRef]
- Hassaan, Z.A.; Yacoub, M.H.; Said, L.A. FPGA-Accelerated ESN with Chaos Training for Financial Time Series Prediction. Mach. Learn. Knowl. Extr. 2025, 7, 160. [Google Scholar] [CrossRef]
- Engle, R.F. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econom. J. Econom. Soc. 1982, 50, 987–1007. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Lopez-Lira, A.; Tang, Y. Can chatgpt forecast stock price movements? return predictability and large language models. arXiv 2023, arXiv:2304.07619. [Google Scholar] [CrossRef]
- Lo, A.W.; Mamaysky, H.; Wang, J. Foundations of technical analysis: Computational algorithms, statistical inference, and empirical implementation. J. Financ. 2000, 55, 1705–1765. [Google Scholar] [CrossRef]
- Mirashk, H.; Albadvi, A.; Kargari, M.; Rastegar, M. News Sentiment and Liquidity Risk Forecasting: Insights from Iranian Banks. Risks 2024, 12, 171. [Google Scholar] [CrossRef]
- Siami-Namini, S.; Namin, A.S. Forecasting economics and financial time series: ARIMA vs. LSTM. arXiv 2018, arXiv:1803.06386. [Google Scholar] [CrossRef]
- Khan, S.; Alghulaiakh, H. ARIMA Model for Accurate Time Series Stocks Forecasting. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 524–528. [Google Scholar] [CrossRef]
- Zhao, P.; Zhu, H.; Ng, W.S.H.; Lee, D.L. From GARCH to neural network for volatility forecast. Proc. AAAI Conf. Artif. Intell. 2024, 38, 16998–17006. [Google Scholar] [CrossRef]
- Ampountolas, A. Enhancing forecasting accuracy in commodity and financial markets: Insights from GARCH and SVR models. Int. J. Financ. Stud. 2024, 12, 59. [Google Scholar] [CrossRef]
- Saâdaoui, F.; Rabbouch, H. Financial forecasting improvement with LSTM-ARFIMA hybrid models and non-Gaussian distributions. Technol. Forecast. Soc. Change 2024, 206, 123539. [Google Scholar] [CrossRef]
- Wang, J.; Hong, S.; Dong, Y.; Li, Z.; Hu, J. Predicting stock market trends using LSTM networks: Overcoming RNN limitations for improved financial forecasting. J. Comput. Sci. Softw. Appl. 2024, 4, 1–7. [Google Scholar]
- Chavhan, S.; Raj, P.; Raj, P.; Dutta, A.K.; Rodrigues, J.J. Deep learning approaches for stock price prediction: A comparative study of LSTM, RNN, and GRU models. In Proceedings of the 2024 9th International Conference on Smart and Sustainable Technologies (SpliTech); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
- Sun, W.; Mei, J.; Liu, S.; Yuan, C.; Zhao, J. Research on deep learning model for stock prediction by integrating frequency domain and time series features. Sci. Rep. 2025, 15, 30386. [Google Scholar] [CrossRef] [PubMed]
- Hamad, K.H.; Salehi, M.; Barrak, J.I.; Khudhair, A.A.; Al-Refiay, H.A.N. The Relationship Between CEO Power, Labor Productivity, and Company Value in the Iraqi Stock Exchange. Risks 2024, 12, 175. [Google Scholar] [CrossRef]
- Kilic, T.; Varhova, A.; Kirci, P. Stock Market Price Forecasting Using the Arima Model: An Application to Istanbul, Turkiye. J. Econ. Policy Res. 2022, 9, 77–90. [Google Scholar]
- Kabir, M.R.; Bhadra, D.; Ridoy, M.; Milanova, M. LSTM–Transformer-Based Robust Hybrid Deep Learning Model for Financial Time Series Forecasting. Sci 2025, 7, 7. [Google Scholar] [CrossRef]
- Chhajed, S.; Tripathi, A. Application of Large Language Models in Forecasting Stock Prices. Technical Report. 2024. Available online: https://ssrn.com/abstract=4993835 (accessed on 1 April 2026).
- Wang, M.; Izumi, K.; Sakaji, H. Llmfactor: Extracting profitable factors through prompts for explainable stock movement prediction. arXiv 2024, arXiv:2406.10811. [Google Scholar] [CrossRef]
- Xiao, M.; Jiang, Z.; Qian, L.; Chen, Z.; He, Y.; Xu, Y.; Jiang, Y.; Li, D.; Weng, R.L.; Peng, M.; et al. Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models. arXiv 2025, arXiv:2503.67890. [Google Scholar]
- Bi, S.; Xiao, J.; Deng, T. The Role of AI in Financial Forecasting: ChatGPT’s Potential and Challenges. In Proceedings of the 4th Asia-Pacific Artificial Intelligence and Big Data Forum; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1064–1070. [Google Scholar]
- Wu, S.; Xiong, Y.; Cui, Y.; Wu, H.; Chen, C.; Yuan, Y.; Huang, L.; Liu, X.; Kuo, T.W.; Guan, N.; et al. Retrieval-augmented generation for natural language processing: A survey. arXiv 2024, arXiv:2407.13193. [Google Scholar] [CrossRef]
- Yang, S.; Wang, D.; Zheng, H.; Jin, R. TimeRAG: Boosting LLM Time Series Forecasting via Retrieval-Augmented Generation. arXiv 2024, arXiv:2412.16643. [Google Scholar]
- Tire, K.; Taga, E.O.; Ildiz, M.E.; Oymak, S. Retrieval Augmented Time Series Forecasting. arXiv 2024, arXiv:2411.08249. [Google Scholar]
- Zhang, B.; Yang, H.; Zhou, T.; Babar, A.; Liu, X.Y. Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models. arXiv 2023, arXiv:2310.04027. [Google Scholar] [CrossRef]
- Wawer, M.; Chudziak, J.A. Integrating Traditional Technical Analysis with AI: A Multi-Agent LLM-Based Approach to Stock Market Forecasting. arXiv 2025, arXiv:2506.16813. [Google Scholar]
- Liu, J.; Yang, L.; Li, H.; Hong, S. Retrieval-augmented diffusion models for time series forecasting. Adv. Neural Inf. Process. Syst. 2024, 37, 2766–2786. [Google Scholar]
- Wikipedia Contributors. Global Industry Classification Standard—Wikipedia, the Free Encyclopedia. 2025. Available online: https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard (accessed on 2 November 2025).
- Yahoo Finance. Available online: https://finance.yahoo.com/ (accessed on 28 March 2025).
- National Stock Exchange of India—Market Data. Available online: https://www.nseindia.com/ (accessed on 28 March 2025).
- White, J.; Fu, Y.; Chen, Q.; Yuan, S.; Haller, P.; Kühn, E. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv 2023, arXiv:2302.11382. [Google Scholar] [CrossRef]
- Wolf, H. Volatility: Definitions and Consequences; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
- Blitz, D.; Van Vliet, P. The volatility effect: Lower risk without lower return. J. Portf. Manag. 2007, 34, 102–113. [Google Scholar] [CrossRef]
- Ang, A.; Hodrick, R.J.; Xing, Y.; Zhang, X. The cross-section of volatility and expected returns. J. Financ. 2006, 61, 259–299. [Google Scholar] [CrossRef]
- Prucker, P.; Bressem, K.K.; Kim, S.H.; Weller, D.; Kader, A.; Dorfner, F.J.; Ziegelmayer, S.; Graf, M.M.; Lemke, T.; Gassert, F.; et al. Privacy-Preserving Generation of Structured Lymphoma Progression Reports from Cross-sectional Imaging: A Comparative Analysis of Llama 3.3 and Llama 4. J. Imaging Inform. Med. 2025, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
- McKinney, W. Pandas: A foundational Python library for data analysis and statistics. Python High Perform. Sci. Comput. 2011, 14, 1–9. [Google Scholar]
- Bisong, E. Matplotlib and seaborn. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Springer: Berkeley, CA, USA, 2019; pp. 151–165. [Google Scholar]
- Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef]
- Abudy, M.M.; Kaplanski, G.; Mugerman, Y. Market timing with moving average distance: International evidence. J. Int. Financ. Mark. Institutions Money 2024, 97, 102065. [Google Scholar] [CrossRef]
- Xiao, M.; Jiang, Z.; Qian, L.; Chen, Z.; He, Y.; Xu, Y.; Jiang, Y.; Li, D.; Weng, R.L.; Peng, M.; et al. Retrieval-augmented Large Language Models for Financial Time Series Forecasting. arXiv 2025, arXiv:2502.05878. [Google Scholar]
- Schumpeter, J.A. Capitalism, Socialism and Democracy; Routledge: London, UK, 2013. [Google Scholar]
- Fama, E.F.; French, K.R. The cross-section of expected stock returns. J. Financ. 1992, 47, 427–465. [Google Scholar]
- Mishkin, F.S. The Economics of Money, Banking, and Financial Markets; Pearson Education: London, UK, 2007. [Google Scholar]
- Hamilton, J.D. Oil and the Macroeconomy. In The New Palgrave Dictionary of Economics; Springer: London, UK, 2018; pp. 9753–9759. [Google Scholar]
- Gorton, G.; Rouwenhorst, K.G. Facts and fantasies about commodity futures. Financ. Anal. J. 2006, 62, 47–68. [Google Scholar] [CrossRef]
- Shiller, R.J. Irrational Exuberance: Revised and Expanded Third Edition; Princeton University Press: Princeton, NJ, USA, 2015. [Google Scholar]










| Sector | India | USA | South Africa |
|---|---|---|---|
| Communication Services | Bharti Airtel Limited, Bharti Hexacom Limited, Indus Towers Limited, Tata Communications Limited, Tejas Networks Limited | AT&T Inc., Comcast Corporation, Meta Platforms Inc., T-Mobile US Inc., Verizon Communications Inc. | Hudaco Industries Limited, MTN Group Limited, Telkom SA SOC Limited, Vodacom Group Limited |
| Consumer Discretionary | Blue Star Limited, Crompton Greaves Consumer Electricals Limited, Havells India Limited, Titan Company Limited, Whirlpool of India Limited | Amazon.com Inc., Ford Motor Company, Nike Inc., Tesla Inc., The Home Depot Inc. | African and Overseas Enterprises Limited, City Lodge Hotels Limited, Famous Brands Limited, Lewis Group Limited, Tsogo Sun Gaming Limited |
| Consumer Staples | Dabur India Limited, Emami Limited, Godrej Consumer Products Limited, Hindustan Unilever Limited, ITC Limited, Marico Limited, Nestle India Limited | Coca-Cola Company, Costco Wholesale Corporation, PepsiCo Inc., Procter & Gamble Company, Walmart Inc. | AH-Vest Limited, Crookes Brothers Limited, Pepkor Holdings Limited, RCL Foods Limited, Tiger Brands Limited |
| Energy | GAIL (India) Limited, Oil and Natural Gas Corporation Limited, Reliance Industries Limited, Tata Power Company Limited | Chevron Corporation, ConocoPhillips, Exxon Mobil Corporation, NextEra Energy Inc., Schlumberger Limited | Efora Energy Limited, Exxaro Resources Limited, Sasol Limited, Thungela Resources Limited, TotalEnergies Marketing South Africa (Pty) Ltd. |
| Financials | AU Small Finance Bank Limited, Bandhan Bank Limited, HDFC Bank Limited, ICICI Bank Limited, IndusInd Bank Limited, Kotak Mahindra Bank Limited, State Bank of India | Bank of America Corporation, Citigroup Inc., JPMorgan Chase & Co., Morgan Stanley, Wells Fargo & Company | African Dawn Capital Limited, Finbond Group Limited, Investec Limited, Nedbank Group Limited, Standard Bank Group Limited |
| Healthcare | Aurobindo Pharma Limited, Cipla Limited, Dr. Reddy’s Laboratories Limited, Lupin Limited, Sun Pharmaceutical Industries Limited | Abbott Laboratories, Johnson & Johnson, Merck & Co. Inc., Pfizer Inc., UnitedHealth Group Incorporated | Advanced Call Center Technologies, Ascendis Health Limited, Aspen Pharmacare Holdings Limited, Dis-Chem Pharmacies Limited, Netcare Limited |
| Industrials | GMR Infrastructure & GVK Power and Infrastructure Limited, Larsen & Toubro Limited, Siemens Limited, Thermax Limited, Voltas Limited | 3M Company, Boeing Company, Caterpillar Inc., General Electric Company, Honeywell International Inc. | Barloworld Limited, Brikor Limited, Calgro M3 Holdings Limited, Murray & Roberts Holdings Limited, Reunert Limited |
| Information Technology | Infosys Limited, L&T Technology Services Limited, Tata Consultancy Services Limited, Tech Mahindra Limited, Wipro Limited | Alphabet Inc., Apple Inc., International Business Machines Corporation, Microsoft Corporation, NVIDIA Corporation | AYO Technology Solutions Limited, Altron Limited, Datatec Limited, EOH Holdings Limited, Mustek Limited |
| Materials | Hindalco Industries Limited, Hindustan Zinc Limited, JSW Steel Limited, Steel Authority of India Limited, Tata Steel Limited | Alcoa Corporation, Freeport-McMoRan Inc., Nucor Corporation, The Mosaic Company, United States Steel Corporation | Afrimat Limited, Anglo American plc, ArcelorMittal South Africa Limited, Harmony Gold Mining Company Limited, Impala Platinum Holdings Limited |
| Real Estate | Brigade Enterprises Limited, DLF Limited, Godrej Properties Limited, Oberoi Realty Limited, Prestige Estates Projects Limited | AvalonBay Communities Inc., Equity Residential, Prologis Inc., Simon Property Group Inc., Welltower Inc. | Growthpoint Properties Limited, Putprop Limited, Redefine Properties Limited, Resilient REIT Limited, Vukile Property Fund Limited |
| Threshold | GPT-4.0 | LLaMA-4-Scout-17B-16E | Gemini 2.0 Flash |
|---|---|---|---|
| 53.42% | 45.11% | 46.38% | |
| 56.35% | 48.05% | 38.94% | |
| 49.28% | 40.67% | 31.85% |
| Attribute | Detailed Description |
|---|---|
| Class Level | Defines the investment horizon (1-month or 3-month) and the type of action (buy, sell, hold, or pairwise comparison). |
| Country | The country where the stock is listed and traded (India, United States, or South Africa). |
| Sector 1 | Primary industry sector of Stock 1, categorized using GICS (e.g., Healthcare, Financials, IT). Allows sector-wise performance analysis. |
| Ticker 1 | The official trading symbol of Stock 1 (e.g., HDFCBANK.BSE, ICICIBANK.BSE, INDUSINDBK.BSE, AUBANK.BSE, etc.). Acts as a unique identifier. |
| Stock 1 | The full company name of Stock 1 (e.g., HDFC Bank Limited, ICICI Bank Limited, IndusInd Bank Limited, AU Small Finance Bank Limited, etc.), mapped to its ticker. |
| Sector 2 | Industry sector of Stock 2 (if applicable in pairwise comparisons). Enables cross-sector decision analysis. |
| Ticker 2 | Trading symbol of Stock 2 (in pairwise comparison tasks). |
| Stock 2 | Full company name of Stock 2, used in comparative scenarios. |
| Annual Volatility Stock 1 | Yearly volatility (%) of Stock 1, calculated from daily returns, based on Algorithm 1. Quantifies risk levels for Stock 1. |
| Annual Volatility Stock 2 | Yearly volatility (%) of Stock 2 (in comparison tasks). Assesses risk-return tradeoffs across alternatives. |
| Input | Refers to data source (Yahoo Finance, NSE India, Johannesburg Stock Exchange (JSE)) and period of analysis (March 2022–March 2025). Ensures reproducibility. |
| LLaMA-4-Scout-17B-16E | Categorical recommendation of Meta’s LLaMA-4-Scout-17B-16E model (Buy, Sell, Hold, Stock A, Stock B, None). |
| Gemini 2.0 Flash | Recommendation produced by Google’s Gemini 2.0 Flash model under identical conditions. |
| GPT 4.0 | Recommendation produced by OpenAI’s GPT 4.0 model. Typically the strongest performer, though with higher variability. |
| Actual | Ground-truth investment action derived from realized forward price movement (e.g., Buy if returns are positive). Serves as benchmark. |
| LLaMA-4-Scout-17B-16E Flag | Binary indicator (1/0) showing whether LLaMA-4-Scout-17B-16E ’s recommendation matched the actual outcome. Used for accuracy computation. |
| Gemini 2.0 FlashFlag | Binary indicator of Gemini 2.0 Flash’s correctness (1 = match, 0 = mismatch). |
| GPT-4.0 Flag | Binary indicator of GPT 4.0’s correctness compared to the actual decision. |
| Sum | Total number of models (out of 3) that correctly predicted the actual decision. Captures model consensus strength. |
| LLM | Parameters | Context Window | Multimodality Support |
|---|---|---|---|
| GPT-4.0 | Unknown | 128,000 tokens | Yes |
| Gemini 2.0 Flash-2.0 Flash | Unknown | 2,000,000 tokens | Yes |
| LLaMA-4-Scout-17B-16E | 17 B | 120,000 tokens | No |
| Horizon | Mean | Median | SD | t (Welch) | p (t) | Shapiro-W | Shapiro-p |
|---|---|---|---|---|---|---|---|
| Short-Term (1-Month) | 44.18% | 39.02% | 16.60% | 0.075 | 0.941 | 0.927 | 0.021 |
| Medium-Term (3-Month) | 43.87% | 40.57% | 17.96% | 0.936 | 0.038 |
| Volatility Type | Stock Count | Mean | Median | Min | Max | SD |
|---|---|---|---|---|---|---|
| Low | 86 | 49.50% | 48.15% | 26.19% | 63.54% | 9.49% |
| Medium | 6 | 45.66% | 49.31% | 8.33% | 100.00% | 31.33% |
| High | 58 | 51.11% | 51.31% | 38.33% | 63.33% | 8.69% |
| Country & Volatility | Stock Count | Mean | Median | Min | Max | SD |
|---|---|---|---|---|---|---|
| India—High | 50 | 54.00% | 60.00% | 22.50% | 79.50% | 28.97% |
| US—Low | 50 | 52.00% | 45.50% | 45.00% | 65.50% | 11.69% |
| South Africa—Low | 36 | 47.22% | 46.53% | 43.75% | 51.39% | 3.87% |
| South Africa—Medium | 6 | 43.06% | 37.50% | 37.50% | 54.17% | 9.62% |
| South Africa—High | 8 | 38.54% | 37.50% | 37.50% | 40.62% | 1.80% |
| Sector & Volatility | Mean | Median | Min | Max | SD |
|---|---|---|---|---|---|
| Communication Services—Low | 26.19% | 21.43% | 17.86% | 39.29% | 11.48% |
| Communication Services—Medium | 8.33% | 0.00% | 0.00% | 25.00% | 14.43% |
| Communication Services—High | 45.24% | 50.00% | 17.86% | 67.86% | 25.34% |
| Consumer Discretionary—Low | 56.25% | 46.88% | 43.75% | 78.12% | 19.01% |
| Consumer Discretionary—High | 45.24% | 39.29% | 35.71% | 60.71% | 13.52% |
| Consumer Staples—Low | 63.54% | 59.38% | 59.38% | 71.88% | 7.22% |
| Consumer Staples—Medium | 58.33% | 75.00% | 0.00% | 100.00% | 52.04% |
| Consumer Staples—High | 59.72% | 70.83% | 25.00% | 83.33% | 30.71% |
| Energy—Low | 57.29% | 56.25% | 53.12% | 62.50% | 4.77% |
| Energy—High | 62.50% | 62.50% | 62.50% | 62.50% | 0.00% |
| Financials—Low | 50.00% | 50.00% | 33.33% | 66.67% | 16.67% |
| Financials—Medium | 41.67% | 41.67% | 25.00% | 58.33% | 16.67% |
| Financials—High | 55.56% | 55.56% | 44.44% | 66.67% | 11.11% |
| Healthcare—Low | 61.11% | 61.11% | 55.56% | 66.67% | 5.56% |
| Healthcare—Medium | 45.83% | 45.83% | 41.67% | 50.00% | 4.17% |
| Healthcare—High | 66.67% | 66.67% | 66.67% | 66.67% | 0.00% |
| Industrials—Low | 42.86% | 42.86% | 35.71% | 50.00% | 7.14% |
| Industrials—High | 47.62% | 47.62% | 47.62% | 47.62% | 0.00% |
| Information Technology—Medium | 66.67% | 100.00% | 0.00% | 100.00% | 57.74% |
| Information Technology—High | 38.33% | 35.00% | 30.00% | 50.00% | 10.41% |
| Materials—Low | 42.50% | 37.50% | 35.00% | 55.00% | 10.90% |
| Materials—High | 60.00% | 60.00% | 45.00% | 75.00% | 15.00% |
| Real Estate—Low | 60.42% | 59.38% | 56.25% | 65.62% | 4.77% |
| Real Estate—Medium | 100.00% | 100.00% | 100.00% | 100.00% | 0.00% |
| Real Estate—High | 58.33% | 62.50% | 16.67% | 95.83% | 39.75% |
| LLM | Mean | 95% CI | Median | Min | Max | SD | Q1 | Q3 |
|---|---|---|---|---|---|---|---|---|
| LLaMA-4-Scout-17B-16E | 48.05% | [46.2, 49.8] | 47.00% | 18.42% | 90.00% | 16.30% | 37.59% | 56.50% |
| Gemini 2.0 Flash | 38.94% | [37.1, 40.8] | 38.80% | 7.00% | 80.00% | 13.99% | 27.00% | 47.00% |
| GPT-4.0 | 56.35% | [54.2, 58.5] | 53.00% | 15.15% | 93.00% | 20.63% | 39.15% | 73.00% |
| Volatility Type | LLaMA-4-Scout-17B-16E (%) | Gemini 2.0 Flash (%) | GPT-4.0 (%) |
|---|---|---|---|
| Low | 44.48 [42.9, 46.1] | 47.97 [46.1, 49.8] | 57.56 [55.4, 59.7] |
| Medium | 54.17 [52.3, 56.0] | 37.50 [35.8, 39.2] | 37.50 [35.9, 39.2] |
| High | 56.90 [55.1, 58.7] | 25.00 [23.5, 26.8] | 73.71 [71.5, 75.9] |
| Task | Model | TP | FP | FN | TN |
|---|---|---|---|---|---|
| 1-Month Buy | GPT-4.0 | 39 | 8 | 30 | 73 |
| Gemini 2.0 Flash | 13 | 28 | 56 | 53 | |
| LLaMA-4-Scout-17B-16E | 37 | 8 | 32 | 73 | |
| Monte Carlo | 2 | 1 | 67 | 80 | |
| 1-Month Sell | GPT-4.0 | 42 | 46 | 14 | 48 |
| Gemini 2.0 Flash | 21 | 63 | 35 | 31 | |
| LLaMA-4-Scout-17B-16E | 34 | 68 | 22 | 26 | |
| Monte Carlo | 9 | 13 | 47 | 81 | |
| 3-Month Buy | GPT-4.0 | 40 | 6 | 46 | 58 |
| Gemini 2.0 Flash | 20 | 14 | 64 | 52 | |
| LLaMA-4-Scout-17B-16E | 38 | 11 | 44 | 57 | |
| Monte Carlo | 26 | 7 | 58 | 59 | |
| 3-Month Sell | GPT-4.0 | 22 | 47 | 23 | 58 |
| Gemini 2.0 Flash | 23 | 60 | 22 | 45 | |
| LLaMA-4-Scout-17B-16E | 20 | 62 | 25 | 43 | |
| Monte Carlo | 29 | 25 | 26 | 70 |
| Task | Model | Precision | Recall | F1 | Balanced Accuracy |
|---|---|---|---|---|---|
| 1-Month Buy | GPT-4.0 | 0.830 | 0.565 | 0.672 | 0.733 |
| Gemini 2.0 Flash | 0.317 | 0.188 | 0.236 | 0.421 | |
| LLaMA-4-Scout-17B-16E | 0.822 | 0.536 | 0.649 | 0.719 | |
| Monte Carlo | 0.667 | 0.029 | 0.056 | 0.508 | |
| 1-Month Sell | GPT-4.0 | 0.477 | 0.750 | 0.583 | 0.630 |
| Gemini 2.0 Flash | 0.250 | 0.375 | 0.300 | 0.352 | |
| LLaMA-4-Scout-17B-16E | 0.333 | 0.607 | 0.430 | 0.442 | |
| Monte Carlo | 0.409 | 0.161 | 0.231 | 0.511 | |
| 3-Month Buy | GPT-4.0 | 0.870 | 0.465 | 0.606 | 0.686 |
| Gemini 2.0 Flash | 0.588 | 0.238 | 0.339 | 0.513 | |
| LLaMA-4-Scout-17B-16E | 0.776 | 0.463 | 0.580 | 0.651 | |
| Monte Carlo | 0.788 | 0.310 | 0.444 | 0.602 | |
| 3-Month Sell | GPT-4.0 | 0.319 | 0.489 | 0.386 | 0.521 |
| Gemini 2.0 Flash | 0.277 | 0.511 | 0.359 | 0.470 | |
| LLaMA-4-Scout-17B-16E | 0.244 | 0.444 | 0.315 | 0.427 | |
| Monte Carlo | 0.537 | 0.527 | 0.532 | 0.632 |
| Task | Model | Precision (Pos) | Recall (Pos) | Precision (Neg) | Recall (Neg) |
|---|---|---|---|---|---|
| 1-Month Buy | GPT-4.0 | 0.830 | 0.565 | 0.709 | 0.901 |
| Gemini 2.0 Flash | 0.317 | 0.188 | 0.486 | 0.654 | |
| LLaMA-4-Scout-17B-16E | 0.822 | 0.536 | 0.695 | 0.901 | |
| Monte Carlo | 0.667 | 0.029 | 0.544 | 0.988 | |
| 1-Month Sell | GPT-4.0 | 0.477 | 0.750 | 0.774 | 0.511 |
| Gemini 2.0 Flash | 0.250 | 0.375 | 0.470 | 0.330 | |
| LLaMA-4-Scout-17B-16E | 0.333 | 0.607 | 0.542 | 0.277 | |
| Monte Carlo | 0.409 | 0.161 | 0.633 | 0.862 | |
| 3-Month Buy | GPT-4.0 | 0.870 | 0.465 | 0.558 | 0.906 |
| Gemini 2.0 Flash | 0.588 | 0.238 | 0.448 | 0.788 | |
| LLaMA-4-Scout-17B-16E | 0.776 | 0.463 | 0.564 | 0.838 | |
| Monte Carlo | 0.788 | 0.310 | 0.504 | 0.894 | |
| 3-Month Sell | GPT-4.0 | 0.319 | 0.489 | 0.716 | 0.552 |
| Gemini 2.0 Flash | 0.277 | 0.511 | 0.672 | 0.429 | |
| LLaMA-4-Scout-17B-16E | 0.244 | 0.444 | 0.632 | 0.410 | |
| Monte Carlo | 0.537 | 0.527 | 0.729 | 0.737 |
| Horizon | Decision | Random | Last Return | Moving Avg. | Monte Carlo | GPT-4.0 | LLaMA-4-Scout-17B-16E | Gemini 2.0 Flash |
|---|---|---|---|---|---|---|---|---|
| 1-Month | Buy | 50.00 | 54.00 | 57.33 | 58.67 | 74.67 | 60.00 | 36.67 |
| 1-Month | Sell | 50.00 | 59.33 | 56.67 | 53.33 | 60.00 | 40.00 | 34.67 |
| 3-Month | Buy | 50.00 | 56.00 | 55.33 | 56.00 | 64.00 | 56.67 | 38.00 |
| 3-Month | Sell | 50.00 | 58.67 | 54.67 | 58.67 | 53.33 | 42.00 | 45.33 |
| Model | 3-Month Buy | 3-Month Sell | 1-Month Buy | 1-Month Sell |
|---|---|---|---|---|
| LLaMA-4-Scout-17B-16E | 0.4322 | −0.1664 | 0.3394 | −0.1185 |
| Gemini 2.0 Flash | 0.1565 | −0.1669 | −0.1566 | −0.1191 |
| GPT-4.0 | 0.4111 | −0.1620 | 0.3514 | −0.0723 |
| Monte Carlo | 0.3657 | −0.1656 | 0.0460 | −0.0821 |
| Test | Result |
|---|---|
| Pairwise Wilcoxon Signed-Rank Tests | GPT-4.0 > LLaMA-4-Scout-17B-16E (); GPT-4.0 > Gemini 2.0 Flash (); LLaMA-4-Scout-17B-16E > Gemini 2.0 Flash () |
| Repeated-Measures ANOVA | , |
| Friedman Test | , |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Mariani, M.C.; Malakar, S.; Bagchi, A.; Basu, S.; Goswami, S.; Tweneboah, O.K.; Biswas, S.; Dey, A.; Sinha, A. Evaluating the Efficacy of Large Language Models in Stock Market Decision-Making: A Decision-Focused, Price-Only, Multi-Country Analysis Using Historical Price Data. Mach. Learn. Knowl. Extr. 2026, 8, 104. https://doi.org/10.3390/make8040104
Mariani MC, Malakar S, Bagchi A, Basu S, Goswami S, Tweneboah OK, Biswas S, Dey A, Sinha A. Evaluating the Efficacy of Large Language Models in Stock Market Decision-Making: A Decision-Focused, Price-Only, Multi-Country Analysis Using Historical Price Data. Machine Learning and Knowledge Extraction. 2026; 8(4):104. https://doi.org/10.3390/make8040104
Chicago/Turabian StyleMariani, Maria C., Sourav Malakar, Amrita Bagchi, Subhrajyoti Basu, Saptarsi Goswami, Osei Kofi Tweneboah, Sarbadeep Biswas, Ankit Dey, and Ankit Sinha. 2026. "Evaluating the Efficacy of Large Language Models in Stock Market Decision-Making: A Decision-Focused, Price-Only, Multi-Country Analysis Using Historical Price Data" Machine Learning and Knowledge Extraction 8, no. 4: 104. https://doi.org/10.3390/make8040104
APA StyleMariani, M. C., Malakar, S., Bagchi, A., Basu, S., Goswami, S., Tweneboah, O. K., Biswas, S., Dey, A., & Sinha, A. (2026). Evaluating the Efficacy of Large Language Models in Stock Market Decision-Making: A Decision-Focused, Price-Only, Multi-Country Analysis Using Historical Price Data. Machine Learning and Knowledge Extraction, 8(4), 104. https://doi.org/10.3390/make8040104

