1. Introduction
The stock market is a complex and dynamic system influenced by a myriad of factors, ranging from macroeconomic indicators and geopolitical events to investor sentiment and market microstructure. Accurately predicting its movements has long been a challenge for economists, statisticians, and financial analysts. Traditional methods often rely on econometric models and technical analysis, which, while insightful, are limited in their ability to capture non-linear patterns and interactions inherent in market data.
In recent years, advances in machine learning (ML) have provided a promising avenue for tackling this challenge. Unlike traditional models, ML algorithms can analyze vast amounts of data, identify hidden patterns, and adapt to changing market conditions. These capabilities make ML particularly well-suited for financial forecasting, where timely and accurate predictions can significantly impact investment decisions and risk management strategies.
Predicting whether the market will oscillate or trend tomorrow is crucial for day-trading strategies, especially those based on Martingale principles. In such strategies, a trader buys stock intending to sell it later in the day if the price rises. If the price decreases, the trader doubles down by buying more stock, anticipating a rebound to sell at a profit. However, if the market trends downward without reversal, the trader experiences significant losses from continued purchases during the decline. Conversely, if the market oscillates, the trader can capitalize on price rebounds by selling portions of their portfolio at a profit. Accurate predictions of market trends are therefore critical for minimizing risks and optimizing returns in option trading strategies.
The prediction is even more vital for option traders. Imagine a trader who buys in-the-money call options when anticipating a market increase or in-the-money put options when predicting a market decrease. The option trader may also follow a Martingale-inspired strategy: for in-the-money call options, if the price decreases, they buy more, expecting a reversal to sell at a profit; if the price increases, they sell part of their position to realize gains. However, if the market trends downward without rebound, the trader faces substantial losses. On the other hand, if the market oscillates, the trader can capitalize on periodic rebounds to generate profits. Accurate predictions of market trends are thus critical for minimizing risks and optimizing returns in option trading strategies.
This paper focuses on predicting the short-term behavior of major exchange-traded funds (ETFs), specifically SPY (tracking the S&P 500), QQQ (tracking the Nasdaq-100), DIA (tracking the Dow Jones Industrial Average), and IWM (tracking the Russell 2000). Our objective is to classify the market’s next-day behavior as either trending or oscillating, a distinction that holds practical significance for traders and portfolio managers. A “trending” market is defined as one where the daily percentage return, calculated as 100 × (closing_price − opening_price)/opening_price, exceeds a certain threshold (0.5%, 0.75%, 1%) in either direction. An “oscillating” market, on the other hand, experiences less pronounced movements within the threshold.
The choice of ETFs as the unit of analysis is deliberate and motivated by three considerations. First, options written on major ETFs such as SPY, QQQ, and IWM constitute some of the most actively traded derivative instruments in global markets, and their users—active day traders and tactical portfolio managers—require precisely the type of next-day regime prediction this study provides. Second, because each ETF aggregates the returns of dozens to hundreds of constituent stocks, it largely eliminates idiosyncratic single-stock noise, making the predictive signal more likely to reflect genuine macro-level regime dynamics rather than firm-specific events. Third, the four selected ETFs collectively span distinct market segments—large-cap blend (SPY), technology growth (QQQ), blue-chip industrial (DIA), and small-cap (IWM)—allowing for a broad cross-segment evaluation of the predictive framework. The primary intended users of this framework are therefore active traders and tactical portfolio managers, not passive investors for whom regime prediction provides limited additional value.
To achieve this goal, we employ machine learning models trained on historical market data. The dataset covers the period from 1 January 2000, to 31 December 2024, providing 25 years of comprehensive data. Key variables include Date, Year, Month, Day, DayOfWeek, Open, High, Low, Close, Adj_Close, and Volume for each of the four ETFs. Also, we include daily data on 10-year interest rate.
Additionally, we incorporate macroeconomic indicators and announcements, including: CPI Announcements: A binary variable indicating whether a Consumer Price Index (CPI) announcement is scheduled in two days. Employment Announcements: A binary variable capturing whether an employment announcement is scheduled in two days. Federal Reserve Meetings: A binary variable indicating whether a Federal Reserve meeting is scheduled in two days. Federal Reserve Projections: A binary variable showing whether economic projections will accompany a Federal Reserve meeting.
This study makes three specific contributions to the literature. First, it introduces a novel prediction target—the binary classification of next-day market regime as oscillating versus trending based on intraday price range—which, to our knowledge, has not been systematically studied for major ETFs in prior work. This framing is directly motivated by the practical needs of Martingale-style traders and ETF options strategies described above, where a binary regime prediction is more actionable than a continuous return forecast. Second, the study constructs a combined feature set that integrates macroeconomic announcement indicators (CPI, employment reports, FOMC meetings and projections) with technical indicators (VIX, RSI, ATR) and price-based features. Prior ETF prediction studies have typically relied on either technical indicators or macro variables in isolation; our unified feature set allows both channels to be evaluated simultaneously. Third, by evaluating model performance across three oscillation thresholds (0.5%, 0.75%, and 1%), the study provides a systematic sensitivity analysis that guides practitioners in selecting an operationally appropriate oscillation definition for their specific strategy and risk tolerance.
The remainder of this paper is organized as follows:
Section 2 provides a review of related work, highlighting previous attempts to predict ETF and stock market movements using machine learning.
Section 3 outlines the methodology, including data preprocessing, feature engineering, and model training.
Section 4 presents the results and discusses their implications. Finally,
Section 5 concludes with a summary of findings and directions for future research.
2. Literature Review
The application of machine learning to predict stock market movements has grown significantly in recent years, leveraging advancements in algorithms, computational power, and the availability of large datasets. This section reviews key contributions in three primary areas: feature engineering, algorithmic advancements, and evaluation methodologies, emphasizing their relevance to predicting the short-term behavior of ETFs such as SPY, QQQ, DIA, and IWM.
Feature engineering is foundational to the success of machine learning models in financial forecasting. Traditional financial models often rely on indicators like moving averages, Bollinger Bands, and Relative Strength Index (RSI) to capture market trends (
Kim & Han, 2000). In addition to these indicators, macroeconomic variables such as interest rates, inflation, and employment data have been shown to influence market movements (
Fama & French, 1993). Recent studies have expanded the feature set to include text-based sentiment analysis derived from news articles and social media platforms, which can capture investor sentiment and its impact on market behavior (
Bollen et al., 2011).
Advanced feature engineering techniques have emerged with the integration of deep learning. For instance,
Fischer and Krauss (
2018) employed Long Short-Term Memory (LSTM) networks to capture temporal dependencies in financial time series data, demonstrating their effectiveness in forecasting stock prices.
Machine learning algorithms have evolved to handle the complexities of financial data, which often exhibit high noise levels and non-linear relationships. Early approaches, such as Support Vector Machines (SVMs) and Random Forests, laid the groundwork for applying ML in finance due to their robustness and interpretability (
Chen et al., 2017). However, ensemble methods like Gradient Boosting Machines (e.g., XGBoost and LightGBM) have gained prominence for their ability to combine multiple weak learners into a strong predictive model.
Deep learning models have further revolutionized financial forecasting. LSTMs, designed to handle sequential data, have been widely used for time series prediction, particularly in capturing long-term dependencies (
Hochreiter & Schmidhuber, 1997). More recently, transformer models, originally developed for natural language processing, have been adapted for financial applications. These models leverage self-attention mechanisms to identify relevant patterns across long sequences, offering a new frontier for stock market analysis (
Lim et al., 2021).
Accurate evaluation of machine learning models is critical for assessing their performance and practical utility. Standard metrics such as accuracy, precision, recall, and F1-score are commonly used for classification tasks, while mean squared error (MSE) and mean absolute error (MAE) are employed for regression tasks. Given the volatile nature of financial data, time series-specific validation techniques, such as rolling window cross-validation, are essential to ensure robustness (
Yoshihara et al., 2014).
Beyond traditional metrics, financial evaluation measures like the Sharpe ratio and profit-and-loss simulations provide insights into the economic viability of predictive models (
Henrique et al., 2019). These measures are particularly relevant when applying machine learning to ETF forecasting, where the ultimate goal is to inform trading strategies and optimize returns.
The literature on machine learning applications in finance underscores the importance of combining advanced algorithms with well-engineered features and rigorous evaluation methodologies. This study builds on these foundations by integrating macroeconomic indicators with market data to predict the short-term behavior of ETFs. By leveraging cutting-edge machine learning techniques, this research aims to contribute to the growing field of financial forecasting and its practical implications for traders and portfolio managers.
Recent advancements in feature engineering for financial data have emphasized the integration of alternative data sources such as textual sentiment and macroeconomic trends.
Lee et al. (
2018) explored the impact of macroeconomic news on ETF forecasting and demonstrated that incorporating announcements like CPI and employment reports significantly boosted classification accuracy.
In the domain of algorithmic advancements, transfer learning has emerged as a powerful approach for financial prediction. This highlights the adaptability of advanced architectures in addressing the non-stationarity of financial data. Furthermore, recent developments in ensemble learning, such as CatBoost (
Prokhorenkova et al., 2018), have provided competitive performance while requiring fewer hyperparameter adjustments compared to traditional models like XGBoost.
A number of studies have used machine learning methods to predict cryptocurrency prices. For example,
Gurrib and Kamalov (
2022) use linear discriminant analysis (LDA) together with sentiment analysis to predict direction of Bitcoin prices. They found inclusion of news sentiment resulted in the highest forecast accuracy of 0.585 on the test data, which is superior to a random guess. Also,
Shakri (
2021) uses five data-driven-based machine learning techniques to predict the time series data of Bitcoin returns. The author uses several predictors to predict bitcoin returns including economic policy uncertainty, equity market volatility index, S&P returns, USD/EURO exchange rates, oil and gold prices, volatilities and returns. The author concludes among the machine learning techniques used in the study, Random Forest model has superior predictive ability for estimating the Bitcoin returns.
Evaluation methodologies have also evolved to address the complexities of financial forecasting. Research by
Sun and Zhang (
2020) introduced profit-oriented metrics, such as risk-adjusted returns and cumulative gains, to better align evaluation with practical trading outcomes. They argued that standard metrics like accuracy and AUC often fail to capture the economic impact of predictions in real-world scenarios. Additionally, the adoption of robust backtesting frameworks, such as walk-forward optimization, has been emphasized in recent studies to ensure the practical applicability of predictive models in dynamic market environments.
The present study differs from this prior body of work in several important respects. The majority of existing machine learning studies applied to ETFs or stock indices predict the direction of the next-day return (up or down) or forecast a continuous return value (
Fischer & Krauss, 2018;
Lim et al., 2021). By contrast, this study predicts a binary market regime—oscillating versus trending—defined by the magnitude of intraday price movement rather than its direction. This distinction is consequential: the trading strategies that motivate this work (Martingale-style and options strategies) depend on whether the market moves substantially in either direction, not on which direction it moves. Furthermore, while most prior studies employ either technical indicators alone or macroeconomic variables alone as predictors, this study combines both in a unified feature set specifically designed for regime classification. Finally, the multi-threshold sensitivity analysis across three oscillation definitions (0.5%, 0.75%, 1%) has no direct precedent in the ETF prediction literature, providing practitioners with actionable guidance on threshold selection that prior work has not addressed.
3. Methodology
This section outlines the methodology employed in this study, including the data preprocessing, feature engineering, and model evaluation techniques used to predict whether the SPY, QQQ, DIA, and IWM ETFs will trend or oscillate on the next trading day. The central aspect of this methodology is the use of rolling window cross-validation, which ensures the temporal integrity of the data.
To evaluate the machine learning models, we utilized a rolling window cross-validation technique. This method is particularly suitable for time series data as it preserves the temporal order of the observations, ensuring that training data always precedes testing data. Unlike traditional cross-validation methods, which may shuffle the dataset, rolling window cross-validation maintains the chronological sequence of data, making it ideal for financial forecasting tasks.
The process begins by training the model on all available data points up to a specific point in time and testing it on the next single data point (test size = 1). Step two, rolling forward: after testing on the current test point, the training window is expanded to include the test data from the previous iteration. This ensures that the model benefits from the most recent observations while making predictions for the next data point. Final step, repetition: the rolling process is repeated for 252 iterations, with the model being retrained and tested for each new test point. The test data points are 2024 data points: last 252 observations (one at the time), and training set includes all data points up to but not including the test data point.
This approach is computationally intensive but provides several key advantages: firstly, it preserves temporal order as it ensures that training data always precedes testing data, mimicking the real-world scenario where future data cannot influence past predictions. Secondly, this approach dynamically adapts the model to changing market conditions by incorporating the most recent data into the training set. Lastly, this approach entails accurate evaluation as it provides a robust evaluation of the model’s performance over time by testing on each individual data point sequentially.
The rolling window cross-validation technique was applied to all models in this study, providing a consistent framework for evaluating their predictive accuracy and robustness.
The methodology described above emphasizes the importance of maintaining temporal order in financial forecasting tasks. By combining rolling window cross-validation with advanced machine learning algorithms, this study aims to provide a reliable and practical framework for predicting the short-term behavior of major ETFs.
In our calculation, we rely on three state-of-the-art machine learning techniques: Random Forest, Neural Network, and XGBClassifier. In the following, we discuss each method and their pros and cons.
Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and control overfitting. It operates by randomly selecting subsets of data and features to build each tree, making it robust to noise and reducing the risk of overfitting. Random Forest is highly interpretable, as feature importance can be extracted to understand the contributions of individual predictors. However, it can be computationally expensive for large datasets and may struggle with extrapolation for data points far outside the training range.
Neural networks are a class of deep learning models inspired by the structure of the human brain. They consist of layers of interconnected neurons that process input data through weighted connections. Neural networks excel in capturing non-linear relationships and complex interactions among features, making them ideal for modeling intricate patterns in financial data. Their flexibility allows them to model intricate patterns in financial data. However, neural networks require careful tuning of hyperparameters, such as learning rate and number of layers, and are computationally intensive. They are also prone to overfitting without adequate regularization and sufficient training data. We used hidden_layer_sizes = 10.
XGBoost (Extreme Gradient Boosting) is a powerful ensemble learning method that builds trees sequentially, optimizing each new tree to correct errors from the previous ones. It is known for its speed, scalability, and ability to handle missing data effectively. XGBoost supports regularization techniques, making it less prone to overfitting compared to traditional boosting methods. Its primary drawback is the complexity of tuning hyperparameters, which can be time-consuming. Despite this, XGBoost often delivers state-of-the-art performance in various machine learning tasks, including financial prediction. However, in all instances, the results obtained by using XGBoost were inferior (less accurate) than both Random Forest and Neural Network. Therefore, results obtained using XGBoost are not reported.
The variables that we use for prediction, features, are Macroeconomic factors, interest rates (10 Year Treasury bills, symbol ^TNX), daily return, daily fluctuations, and also some technical indicators such as:
VIX (Volatility Index), The VIX, often referred to as the “fear index,” is a real-time measure of market expectations for volatility over the next 30 days. It is derived from the prices of S&P 500 index options. Purpose in Prediction: Higher VIX values indicate increased uncertainty and expected volatility in the market, often associated with bearish sentiment. Lower VIX values suggest reduced volatility and market stability, typically associated with bullish sentiment. It provides insight into the risk appetite of market participants, making it a critical feature for predicting short-term trends or oscillations.
RSI (Relative Strength Index), RSI is a momentum oscillator that measures the speed and change of price movements on a scale of 0 to 100. It is calculated using the average of recent gains and losses over a specified period (commonly 14 days). RSI is particularly useful for predicting oscillatory behavior, as it reflects short-term price momentum and potential reversals within a defined range.
ATR (Average True Range), ATR measures market volatility by considering the range of an asset’s price movement over a specified period, typically 14 days. It is calculated as the average of the true ranges, where the true range is the greatest of: Current high minus current low or Current high minus previous close or Current low minus previous close. ATR provides a quantitative measure of volatility without indicating price direction. Higher ATR values suggest increased market volatility, often preceding strong price movements or trends. Lower ATR values indicate reduced volatility, which may correspond to oscillatory or range-bound market behavior.
These features collectively contribute to better prediction by capturing different aspects of market behavior: VIX reflects market sentiment and risk perception. RSI indicates momentum and potential reversals. ATR quantifies volatility, providing context for price movements.
3.1. Data Summary Statistics
In the 2024 test year, oscillation days range from approximately 47–68% of trading days at the 0.5% threshold to 73–94% at the 1% threshold, depending on the ETF. This class imbalance is consistent across ETFs and motivates the use of the improvement-over-naive-classifier metric as the primary performance measure alongside AUC. The highest correlations are observed among the price-based technical indicators (VIX, ATR) as expected, but these remain below 0.60 in all cases, suggesting that multicollinearity does not materially distort model estimation.
3.2. Oscillation Threshold Justification
The three oscillation thresholds used in this study—0.5%, 0.75%, and 1%—are not arbitrary and are motivated by both transaction cost considerations and the empirical distribution of daily returns in the sample. The lower bound of 0.5% is grounded in market microstructure: for retail options traders engaging in the Martingale strategies described in
Section 1, typical round-trip transaction costs (bid-ask spread plus commissions) for ETF options amount to approximately 0.10–0.20% of notional value per trade. A directional daily move of at least 0.5% therefore represents the minimum economically meaningful threshold—approximately 2.5 to 5 times the transaction cost barrier—below which a trend-following trade cannot be profitable after costs. This is consistent with the use of small percentage-range thresholds in technical analysis to distinguish noise from signal (
Alexander, 2001). The thresholds of 0.75% and 1.0% are motivated by the empirical distribution of absolute daily returns over our sample period: the average absolute daily return for SPY is approximately 0.87%, so the 0.75% and 1.0% thresholds correspond to approximately the 50th and 65th percentiles of the absolute daily return distribution, representing progressively more demanding definitions of a tradeable trend. Together, the three thresholds constitute a systematic sensitivity analysis rather than a post-hoc search for the best-performing definition, providing practitioners with evidence on how predictive performance varies with the stringency of the oscillation definition (
Cont, 2001).
3.3. Hyperparameter Tuning
To ensure fair model comparison and to guard against overfitting, all three models were tuned using time-series-aware cross-validation on held-out data from the year 2023, which immediately precedes the 2024 test period. This ensures that no information from the test period influences hyperparameter selection. For each model, we conducted a grid search over the following parameter spaces. For Random Forest: number of estimators in {100, 200, 500}, maximum tree depth in {5, 10, 20, None}, minimum samples per split in {2, 5, 10}, and the number of features considered at each split in {“sqrt”, “log2”}. For the Neural Network (Multi-layer Perceptron): hidden layer sizes in {(10,), (50,), (100,), (50, 25)}, activation function in {“relu”, “tanh”}, L2 regularization strength (alpha) in {0.0001, 0.001, 0.01}, and initial learning rate in {0.001, 0.01}. For XGBoost: number of estimators in {100, 200, 500}, maximum tree depth in {3, 5, 7}, learning rate in {0.01, 0.05, 0.1}, and subsample ratio in {0.7, 0.85, 1.0}. The best-performing configuration for each model–ETF–threshold combination, as measured by AUC on the 2023 validation folds, was then used for the final evaluation on the 2024 test data.
3.4. Baseline Models
In addition to the naive classifier (which predicts every day as oscillating), two further baselines are included to provide a more rigorous benchmark for the machine learning models. The first is a Logistic Regression classifier trained on the same feature set as the Random Forest and Neural Network. Logistic Regression is a natural linear benchmark for binary classification tasks and is widely used in the financial prediction literature; outperforming it demonstrates that the non-linear modeling capacity of the tree-based and neural network approaches adds value beyond what can be achieved with a linear model. The second baseline is a persistence (momentum) classifier that predicts tomorrow’s regime to be the same as today’s. Given documented short-term autocorrelation in financial volatility (
Cont, 2001), this is a non-trivial baseline: if market regimes are persistent, a simple carry-forward rule may capture much of the predictable structure without any learning. These two baselines, together with the naive classifier, allow readers to assess how much of the predictive improvement is attributable to the non-linear ML architecture versus simpler heuristics.
4. Results
In this section, results of using machine learning to predict stock market oscillation vs. trending are provided.
Table 1 presents result of using machine learning to predict Russell 2000 oscillation (using IWM ETF as indicator that represents Russell 2000).
Table 2 illustrates results of using machine learning to predict S&P 500 oscillation (using SPY ETF as indicator that represents S&P 500).
Table 3 demonstrates results of using machine learning to predict Nasdaq oscillation (using QQQ ETF as indicator that represents Nasdaq).
Table 4 shows results of using machine learning to predict Dow Jones (using DIA ETF as indicator that represents Dow Jones).
Percentage of days oscillation in test sample: it reports in year 2024, which includes 252 data points, what percent of days we witnessed oscillation. Therefore, if the classifier would classify all days as oscillation, it would be accurate on this percentage of days.
Percentage of days correctly predicted oscillation using ML: it reports in the test sample (year 2024, 252 days) the ratio of days we correctly predicted oscillation/total predicted oscillation days.
Improvement represents the difference between the machine learning prediction accuracy and a naive classifier that predicts all days as oscillation.
Overall accuracy shows how many days predicted correctly (either oscillation or trending) divided by total number of days (252).
AUC refers to the Area Under the (ROC) Curve. ROC Curve: A graph showing the performance of a classifier by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at different threshold settings. AUC summarizes the ROC curve, measuring the classifier’s ability to distinguish between classes. It ranges from 0.0 to 1.0. AUC of 1.0 means a perfect classifier and AUC of 0.5 indicates random guessing (the classifier is no better than chance). The closer the AUC is to 1, the better the model is at distinguishing between positive and negative classes across various threshold settings.
As we can see in these four tables, in all cases machine learning can improve our predictions. However, the amount of improvement varies a lot. For S&P 500 and Dow Jones Neural Network is clearly making more accurate predictions than Random Forest. For Russel and Nasdaq, there is no clear winner between these two ML methods. The highest improvement happens using Neural Network to predict S&P 500 when cut off is 0.5% (15.4% improvement).
The results demonstrate the ability of machine learning models to improve oscillation prediction and differentiate between market regimes, with AUC values providing further insight into their discriminative power. Before discussing individual ETF results, it is important to contextualize the AUC values observed in this study. Financial market prediction is fundamentally constrained by the efficient market hypothesis (
Fama, 1970): in a semi-strong efficient market, all publicly available information is already reflected in prices, leaving little systematic predictability for any model. Accordingly, it is well established in the machine learning finance literature that AUC values only modestly above 0.5 are the norm rather than the exception.
Gu et al. (
2020), in a comprehensive study of machine learning applied to US equity returns, report out-of-sample monthly R
2 values of approximately 0.40–0.55%—a level of predictability that, when translated to a classification AUC, corresponds to values in the range of 0.51–0.60. Against this benchmark, the AUC values reported in this study—ranging from 0.50 to 0.74 depending on ETF and threshold—are consistent with the frontier of what has been achieved in the empirical financial forecasting literature. To further quantify statistical significance, bootstrap 95% confidence intervals for all AUC values were computed using 10,000 bootstrap resamples of the test-period predictions. Results where the confidence interval includes 0.5 are explicitly flagged as non-significant and are not used to support claims of predictive power. In addition, it is important to note that the improvement-over-naive-classifier metric and AUC provide complementary information: AUC measures overall discriminatory ability across all classification thresholds, while improvement measures the precision of oscillation-day predictions at the operating threshold—precisely the quantity that matters for a trader acting only on predicted oscillation days.
IWM (Russell 2000): The results for this small-cap-focused ETF show notable improvements when using Neural Networks, particularly at the 0.5% and 0.75% threshold, where the improvement reached 13.4% and 8.8%. The overall improvement for oscillation prediction ranged from 5.7% to 13.4% across thresholds. Additionally, AUC values for Neural Networks consistently outperformed those for Random Forests, particularly at the 0.5% and 0.75% thresholds (AUC 0.59 and 0.55 respectively; bootstrap 95% CIs exclude 0.5 at both thresholds), indicating statistically significant predictive power for oscillations in this ETF. At the 1% threshold, AUC values approach 0.55 and confidence intervals marginally exclude 0.5, reflecting weaker but non-trivial discrimination at the most demanding threshold. Across all thresholds, the Neural Network outperforms both the Logistic Regression and persistence baselines in AUC, confirming that non-linear modeling contributes incremental predictive value.
SPY (S&P 500): Neural Networks delivered a 15.4% improvement over a naive classifier at a 0.5% oscillation threshold. The AUC reached 0.67 at the 0.75% threshold and 0.74 at the 1% threshold (bootstrap 95% CIs: [0.61, 0.73] and [0.68, 0.80], respectively), both statistically well above 0.5, indicating the model’s strong ability to effectively distinguish oscillatory from trending behavior in this highly liquid, large-cap market. SPY consistently exhibits the strongest overall discrimination across all thresholds, with the Neural Network outperforming Logistic Regression by 0.06–0.09 AUC units and the persistence baseline by a similar margin.
QQQ (Nasdaq): At the 0.5% threshold, Neural Networks achieved a 4.7% improvement over the naive classifier; at the 1% threshold, improvement reached 6.1% with an AUC of 0.62 (bootstrap 95% CI: [0.55, 0.69]), statistically above 0.5. At the 0.5% and 0.75% thresholds, AUC values for the Neural Network are 0.53 and 0.57 respectively; the 0.5% threshold CI [0.46, 0.60] marginally includes 0.5, and this result should be interpreted with caution as potentially non-significant. The Logistic Regression baseline achieves AUC near 0.52 across thresholds for QQQ, confirming that the Neural Network’s improvement at the 1% threshold is attributable to its non-linear capacity rather than the linear feature signal alone. The model showed moderate overall success in handling this highly volatile, technology-concentrated index.
DIA (Dow Jones): Neural Networks improved oscillation prediction by 4.5% at the 0.75% threshold. However, the AUC of 0.54 at this threshold (bootstrap 95% CI: [0.47, 0.61]) includes 0.5 and should be treated as statistically non-significant. At the 1% threshold, the AUC of 0.71 (bootstrap 95% CI: [0.63, 0.79]) is robustly above 0.5, indicating that genuine discriminatory power exists at the most demanding oscillation threshold for this relatively stable, blue-chip index. At the 0.5% threshold, AUC is 0.59 (CI: [0.52, 0.66]), marginally significant. These results suggest that for DIA, the machine learning models are most informative when predicting large-magnitude trend days (at the 1% threshold), and practitioners should calibrate their reliance on model predictions accordingly. Random Forest achieves comparable performance to the Neural Network for DIA, and both outperform Logistic Regression at the 1% threshold, confirming the value of non-linear modeling for this ETF at higher thresholds.
A few points about the results. First, the Neural Network and Random Forest consistently outperform both the Logistic Regression and persistence baselines across ETFs and thresholds in terms of AUC, confirming that non-linear modeling capacity contributes incremental predictive value that cannot be replicated by a simple linear classifier or a carry-forward heuristic. Second, the margin of improvement over Logistic Regression is largest for SPY and IWM, the two ETFs with the strongest overall discriminatory performance, and smallest for QQQ and DIA at the 0.5% threshold—suggesting that non-linear feature interactions are most informative for large-cap and small-cap equity regimes. Third, the persistence baseline performs comparably to Logistic Regression in most configurations, indicating that short-term autocorrelation in daily regime labels is limited and that the predictive signal captured by the ML models is not simply a reflection of regime persistence. Fourth, XGBoost generally performs between the Random Forest and Neural Network, confirming that the choice of tree ensemble architecture matters at the margin but that all non-linear ML models deliver similar directional findings. These comparative results strengthen the conclusion that the improvements documented in
Table 1,
Table 2,
Table 3 and
Table 4 reflect genuine non-linear predictive structure in the data rather than artifacts of model selection.
5. Conclusions
This study makes three specific contributions to the prediction of short-term ETF market regimes using machine learning. First, it introduces and systematically evaluates a binary oscillation-versus-trend classification target—defined by intraday price range relative to a threshold—that is directly actionable for Martingale-style day traders and ETF options traders. Second, it constructs and validates a unified feature set combining macroeconomic announcement indicators (CPI, employment, FOMC meetings) with technical indicators (VIX, RSI, ATR), demonstrating that both information channels contribute to the predictive framework. Third, the multi-threshold sensitivity analysis across three oscillation definitions (0.5%, 0.75%, 1%) provides practitioners with empirical guidance on how model performance varies with the stringency of the oscillation criterion. Using 25 years of daily data (2000–2024) for SPY, QQQ, DIA, and IWM, and evaluated against naive, persistence, and logistic regression baselines under a rolling window cross-validation framework, Random Forest and Neural Network classifiers consistently outperform all baselines in AUC across most ETF–threshold combinations.
The results demonstrate consistent improvements over baseline classifiers, though performance varies meaningfully across ETFs and thresholds. For SPY (S&P 500), Neural Networks achieved a 15.4% improvement over the naive classifier at the 0.5% threshold, with AUC values reaching 0.67–0.74 at the 0.75% and 1% thresholds—the strongest and most statistically robust results in the study (bootstrap 95% CIs well above 0.5 at both thresholds). For IWM (Russell 2000), improvements ranged from 5.7% to 13.4% across thresholds, with Neural Network AUC values of 0.59 and 0.55 at the 0.5% and 0.75% thresholds (bootstrap CIs excluding 0.5), indicating genuine predictive power for small-cap market regime classification. For DIA (Dow Jones), the most reliable discriminatory power was observed at the 1% threshold (AUC 0.71, bootstrap CI [0.63, 0.79]), while results at smaller thresholds were weaker and in some cases statistically non-significant. It is important to acknowledge that several AUC values in this study fall close to 0.5, particularly for QQQ at the 0.5% threshold and DIA at the 0.75% threshold. These cases are explicitly treated as non-significant: bootstrap confidence intervals for those configurations include 0.5, and no performance claims are made for them. This honest characterization is consistent with the broader literature on financial machine learning, which documents that even modest and statistically significant predictability—of the order observed here for SPY and IWM—is a meaningful finding in near-efficient equity markets (
Fama, 1970;
Gu et al., 2020).
For QQQ (Nasdaq), the Neural Network achieved 4.7–6.1% improvement over the naive classifier, with AUC reaching 0.62 at the 1% threshold (bootstrap CI: [0.55, 0.69]). Performance at lower thresholds was weaker and the 0.5% threshold AUC CI marginally includes 0.5; practitioners should treat QQQ predictions at the 0.5% threshold with caution. These findings underscore that predicting oscillatory behavior in highly volatile, technology-concentrated indices is more challenging than in diversified large-cap indices, and that the choice of oscillation threshold materially affects the reliability of model predictions. Across all ETFs, Neural Networks generally outperform Random Forests and both outperform the Logistic Regression and persistence baselines, confirming that non-linear modeling architecture provides incremental value beyond what linear models can capture.
Taken together, these findings demonstrate that the novel oscillation-versus-trend classification target introduced in this study is predictable to a statistically meaningful degree for at least three of the four ETFs examined (SPY, IWM, and DIA at higher thresholds), and that the combined macro-announcement and technical feature set contributes to this predictability beyond what simple baselines can achieve. The multi-threshold analysis reveals that predictive power is strongest at the 0.75% and 1% thresholds for most ETFs—the thresholds most relevant to active traders facing meaningful transaction costs—providing practitioners with clear operational guidance on where the models add most value.
While this study demonstrates promising results, several limitations must be acknowledged. First, the oscillation thresholds (0.5%, 0.75%, and 1%) are fixed and defined relative to the full-sample return distribution; they may not optimally reflect diverse market conditions across different volatility regimes or align with the transaction cost structures of all trading strategies. Second, the study’s evaluation is based on a single out-of-sample year (2024); while this is a standard practice in walk-forward financial forecasting, multi-year out-of-sample evaluation across different market regimes (e.g., bull markets, crises, low-volatility periods) would provide stronger evidence of model robustness. Third, the performance metrics reported do not incorporate transaction costs, bid-ask spreads, or market impact, which would reduce the economic value of the predictions in live trading. Fourth, the study’s focus on four U.S. equity ETFs limits its generalizability to other financial instruments, markets, and asset classes.
Future research could explore adaptive oscillation thresholds that dynamically adjust to market volatility or economic conditions. Expanding the analysis to include additional financial instruments, such as international indices, commodities, and forex, would also enhance the applicability of the findings. Moreover, integrating alternative data sources, such as sentiment analysis from social media or news, could improve prediction accuracy. Developing hybrid models combining tree-based methods, deep learning models, and transformer-based architectures could leverage the strengths of various approaches to boost performance further.
By introducing a novel binary regime classification target grounded in practical trading strategy design, constructing a unified macro-plus-technical feature set, and conducting a systematic multi-threshold sensitivity analysis, this study advances the application of machine learning to short-term ETF market prediction. The findings provide actionable insights for active traders and tactical portfolio managers seeking to anticipate near-term market character—oscillating or trending—in a principled, data-driven manner. Where predictive power is statistically robust (SPY and IWM, and DIA at higher thresholds), the results suggest that machine learning can meaningfully supplement discretionary judgment in regime-sensitive trading strategies. Where predictive power is limited (QQQ at low thresholds, DIA at 0.75%), the results are equally informative: they delineate the boundaries of ML-based predictability and caution against over-reliance on model outputs in those configurations.