1. Introduction
Numerous analysis and modeling techniques have been developed for time series data [
1]. The earliest approaches include statistical methods such as the Box–Jenkins method developed by George Box in the 1970s [
2], where analyses were performed on relatively small data sets. Since then, advances in computing technologies, data storage, and electronic data acquisition have enabled the collection and processing of large-scale time series data. Modern approaches to time series modeling include a wide range of machine learning and deep learning methods [
3], with examples such as Gramian angular fields (GAF) and Markov transition fields (MTF) [
4], heuristic methods [
5], and ensemble learning [
6].
Financial time series (FTS) is a subset of the time series domain and represents the observed value of a financial asset over time. The analysis of financial markets plays a significant role in economic decision-making and influences the behavior of individuals and institutions. Due to the importance of FTS, numerous modeling techniques have been developed to predict movements within financial markets [
7]. However, the irregular, noisy, and non-stationary characteristics of FTS make modeling a complex and error-prone task [
7].
A key challenge in FTS modeling is that no single modeling approach consistently outperforms others across different data sets, as suggested by the no free lunch theorem (NFLT) [
8]. Each modeling approach is problem-specific, and performance varies depending on the characteristics of the underlying data. Ensemble learning seeks to address this challenge by improving the generalization capabilities of individual models through the combination of multiple learners [
9]. Ensemble learning is considered a meta-approach to machine learning [
10], where the focus is on the process of learning rather than the performance of a single model [
11]. Ensemble models can be constructed using a variety of machine learning algorithms and are typically developed using bagging, boosting, or stacking strategies [
10].
Existing research on ensemble modeling for FTS has primarily focused on isolated components of the ensemble design process. These components include improvements in base models [
12,
13,
14,
15], feature engineering [
16,
17,
18], and architectural variations [
19,
20,
21]. However, the literature does not provide a comprehensive investigation into the end-to-end design of bagging ensemble models for FTS classification problems [
22]. Current studies seldom examine how multiple ensemble components, which include bootstrap methods, ensemble sizes, voting mechanisms, and loss functions, interact to influence predictive performance. Furthermore, prior work rarely evaluates ensemble design choices across a diverse set of baseline models that span econometric, machine learning, and deep learning domains.
This study aims to address these gaps by treating ensemble design as a multi-dimensional modeling problem for FTS classification tasks. The contributions of this study include: (i) an end-to-end evaluation framework for bagging ensemble models applied to FTS classification tasks; (ii) the introduction of a sub-sampling bootstrap method tailored for FTS and its comparison with established time series bootstrap techniques; (iii) the development of an optimized ensemble weighting mechanism using particle swarm optimization (PSO) and quantum-inspired particle swarm optimization (QPSO) algorithms; and (iv) a comparative analysis across baseline models from econometrics (autoregressive integrated moving average (ARIMA)), machine learning (decision tree (DT), logistic regression (LR), multi-layer perceptron (MLP)), and deep learning (long short-term memory network (LSTM)), thereby providing a cross-domain evaluation of ensemble performance. The study also includes a generalizable ensemble framework that can be applied to any classifier and remains interpretable.
Bagging ensemble models are developed and evaluated on six FTS data sets, which are transformed into a classification problem where the objective is to predict the direction of movement mapped to a buy, sell, or do-nothing action. This formulation aligns with practical financial decision-making, where directional movements correspond directly to actionable trading outcomes [
23]. Although transformer-based and generalized autoregressive conditional heteroskedasticity (GARCH)-type models have shown promise in recent FTS research, they are not included in this study due to the focus on classification-oriented ensemble design rather than sequence-to-sequence forecasting or volatility modeling [
24,
25]. The main objectives of this study are to analyze the performance of time series bootstrap methods, evaluate optimized voting mechanisms, assess the impact of ensemble sizes, and examine the influence of loss functions used in metaheuristic optimization. A preliminary objective is to perform a detailed data analysis of the six FTS data sets to inform the development of an FTS classification problem. The insights obtained from this analysis guide the ensemble model development process.
The structure of this paper includes a review of related works, a description of the empirical process, the design and development of the ensemble models, a presentation of ensemble modeling results, and statistical evaluation between the best ensemble configuration and baseline models. The paper concludes with final remarks, and proposed directions for future work.
2. Related Works
This section reviews the literature relevant to FTS modeling and analysis. It presents foundational concepts, which include formal definitions of time series and FTS, as well as key analytical properties such as stationarity, entropy, and commonly used financial indicators. The review also outlines methodological components that appear frequently in prior work, which include ensemble methods, bootstrap procedures for dependent data, and metaheuristic optimization strategies. These elements establish the conceptual and methodological background for a comparative analysis that concludes this section.
2.1. Time Series
A time series is a sequence X of observations recorded over a period of time. The observations may occur across a continuous interval, at regular sample intervals, or at fixed time points. A time series X is defined as where each and , with representing the index set. An FTS follows the same structure, with each interpreted as a financial observation within the sequence.
2.2. Stationarity
Stationarity is a property of a time series that reflects whether its statistical characteristics, such as the mean, variance, and covariance, remain constant over time. A time series is stationary when the process that generates the observations does not depend on time [
26]. In such cases, no trends or seasonality appear in the data. A non-stationary series displays time-dependent behavior, with changes in its statistical properties and visible features such as trends or seasonality. An analysis of a non-stationary series must account for this time dependence [
26]. Stationarity reduces the complexity of predictive model development, and
Table 1 outlines the main forms of stationarity. An FTS benefits from stationarity in the same way, as stable statistical properties simplify model construction. Two approaches assist in determining whether a series arises from a stationary process, namely visual inspection and statistical tests. Visual inspection provides a judgment-based assessment of the data. The statistical tests used in this study are the augmented Dickey–Fuller (ADF) test [
27] and the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test [
28]. Both tests rely on the presence or absence of a unit root, a feature of a stochastic process [
28], and are often applied together to confirm the stationarity properties of a time series.
2.3. Entropy
Entropy is a foundational concept in information theory, introduced by Shannon (1948) [
29], and serves as a measure of uncertainty in random variables. In this context, a variable represents a unit of information storage, expressed in bits. Within time series and FTS analysis, entropy provides insight into the level of volatility in a series by distinguishing between common and rare events reflected in the data. The probability of an event determines the amount of information required to represent that event. To apply Shannon’s entropy to a time series
X, the probability space is defined by the distribution
for each value of
X, where
and
i denotes the position of
x in the sequence. Entropy for a continuous-valued time series is defined as
Additional information measures extend this concept. Conditional entropy (CE) [
30] quantifies the amount of information required to describe a random variable
Y when the value of a related variable
X is known. The entropy of
Y conditioned on
X is written as
and is defined as
Mutual information (MI) [
31] measures the degree of dependence between two random variables. For two time series
X and
Y, MI is written as
and is defined as
A special case arises when
, in which case
equals
, confirming that the information in a variable is fully dependent on itself. These entropy-based measures assist in the analysis of time series and FTS applications, particularly in feature selection tasks [
32] where multiple measurements originate from the same series.
2.4. Financial Time Series Indicators
Financial indicators are statistics that assist in the assessment of the soundness, stability, and performance of a financial asset. These indicators also provide insight into economic activity across different sectors. The analysis and development of FTS indicators form part of the technical analysis of an asset. Two categories of indicators are relevant to this study, namely leading indicators [
33] and lagging indicators [
34]. A leading indicator is a variable that signals future change or movement in another FTS, process, trend, or event before the change occurs. A lagging indicator is a variable that correlates with an FTS and confirms the presence of a trend.
Table 2 lists several leading and lagging indicators together with their classification and definitions. The indicators of interest in this study are the average directional index (ADX), relative strength index (RSI), and simple moving average (SMA). These indicators provide direct information about trend strength, momentum, and directional movement, which aligns with the objective of classifying short-term behavior in FTS.
2.5. Ensemble Modeling
An ensemble model combines the predictions of multiple base models or learners. A level of diversification is achieved in an ensemble design, which allows the model to address patterns in complex data sets such as time series data [
22]. Three main approaches exist in ensemble model construction, namely bagging, boosting, and stacking. Each approach applies a distinct method that enables members of an ensemble to capture different patterns in separate regions of a data set. A bagging ensemble relies on a statistical procedure known as bootstrapping, which varies samples of the data set during the training phase [
35]. The trained members of an ensemble contribute to a final prediction through a voting mechanism. Bagging therefore represents a data-driven approach, and diversification arises from variation in the training data. In contrast, boosting represents a model-driven approach that aims to correct errors from prior models by altering the focus of the learning algorithm [
35]. Models are added sequentially, with each new model aiming to correct the errors of the previous one. The data set remains unchanged, but the algorithm assigns weights that direct attention toward correct or incorrect predictions [
10]. This weighting mechanism ensures a diverse set of learners that collectively outperform a single base model. Stacking is an ensemble method that combines multiple models to produce an output that exceeds the performance of any individual model [
36]. A stacked ensemble uses several models in an initial layer and then uses their outputs as inputs to a final model. The models in the initial layer operate independently, and the training data remains unchanged. This study focuses on the bagging approach. The evaluation includes six bootstrap methods for time series data, metaheuristic procedures for the optimization of voting mechanisms, and an assessment of the effect of ensemble size on classification performance.
2.6. Bootstrap Methods for Time Series
The bootstrap method introduced by Efron (1979) [
37] is a statistical procedure that estimates properties of a population by resampling from an observed data set. A bootstrap sample arises by drawing observations one at a time from the data and returning each selected value to the sample space. This procedure may occur with or without replacement. Time series data introduce additional complexity due to the dependence between observations. As a result, any resampling procedure must preserve the dependent structure of a series [
38]. This study evaluates six bootstrap methods for time series data that support the development of ensemble models. These methods are the block, moving-block, sieve, stationary, and Tukey bootstrap methods. A sub-sampling method has been developed as a competing method.
The
block bootstrap, illustrated in
Figure 1a, resamples non-overlapping blocks of data rather than individual observations [
39]. The time series is divided into blocks to preserve its dependent structure, and a bootstrap sample arises from a random selection of these blocks. The block size is a key parameter. A small block size may fail to preserve dependence, whereas a large block size may reduce variability in the resampled data [
40,
41].
The
moving-block bootstrap (MBB) [
42], shown in
Figure 1b, follows the same principle but uses overlapping blocks of fixed length. The block size determines the number of data points in each block, and a bootstrap sample arises from a random selection of these overlapping blocks. This approach aims to preserve the dependent structure of the original series.
The
sieve bootstrap introduced by Bühlmann (1997) [
43] approximates the residuals of a time series with an autoregressive model. This method preserves the dependent structure of the series by resampling residuals from an autoregressive process of order
. The order increases with the sample size under the conditions
and
as
, where is
defines the order and magnitude of the autoregressive model and
is a function to define the growth rate in the number of samples [
43]. A bootstrap sample then arises from the autoregressive model in a manner similar to the MBB.
The
stationary bootstrap [
41] draws blocks of random length, where the starting index follows a uniform distribution and the block length follows a geometric distribution. This method is suitable for time series that require preservation of dependence but do not satisfy strict stationarity. This characteristic is helpful with modeling real-world data sets. However, it is not recommended for data sets with strong trends.
A sub-sampling method developed by the authors serves as a competing approach. The method uses a block size equal to the data length n divided by the number of required samples m. A random process selects overlapping blocks of fixed length, and each bootstrap sample contains observations.
The
Tukey bootstrap [
44] is a variant of the moving-block bootstrap that applies tapering at the edges of each block to ensure continuity in the bootstrap sample path. The tapering acts as a window function that smoothens the boundaries between adjacent blocks, which reduces discontinuities when blocks are concatenated. This method supports variance estimation, with a specific focus on the estimation of sample means.
2.7. Metaheuristics
Metaheuristics are high-level, problem-independent strategies that guide a search procedure toward a global best solution, without any guarantee of success for a specific optimization problem [
45]. A metaheuristic updates a candidate solution through a sequence of steps until a termination criterion indicates that no further improvement is possible. Numerous metaheuristic families exist, including nature-inspired procedures such as particle swarm optimization [
46] and metallurgy-inspired procedures such as simulated annealing [
47]. Rezk and Selim (2024) [
18] provide an overview of metaheuristic procedures used in ensemble model construction, with an emphasis on methods that adjust ensemble member contributions. According to Rezk and Selim (2024) [
18], swarm-based algorithms appear as the second most common metaheuristic class after evolutionary algorithms, and particle swarm optimization appears as the second most preferred algorithm after the genetic algorithm [
48]. Rezk and Selim (2024) [
18] also note that the selection of a metaheuristic depends on problem-specific criteria such as performance, diversity, complexity, and efficiency. Time series data sets arise from dynamic environments with varying levels of complexity; therefore, two metaheuristic procedures are used in this study, namely the particle swarm optimization and quantum-inspired particle swarm optimization. Both procedures are used to adjust ensemble member contributions within a voting mechanism.
The
particle swarm optimization (PSO) algorithm is a nature-inspired procedure based on the social behavior of birds. PSO uses a stochastic search process to explore a population of candidate solutions and identify an optimal solution that satisfies predefined criteria [
46]. Each candidate solution is a particle and the number of particles is user defined. Each particle is represented by four vectors: the position vector
, the personal best vector
, the velocity vector
, and the global best vector
, which records the best position found by any particle in the swarm. The procedure begins with an initialization of positions and velocities. Velocities are set to zero, and the vectors
and
are initialized to
and the best among the
values. The velocity update rule with inertia, introduced by Shi and Eberhart (1998) [
49], is
where,
is the inertia weight,
and
are vectors of random values in
, and
and
are acceleration coefficients that control cognitive and social influence. The position update rule is
After each update, the vectors and are evaluated to determine whether new personal or global best positions have been reached. A termination condition signals the end of the procedure once no significant improvement is possible.
The
quantum-inspired particle swarm optimization (QPSO) algorithm extends PSO by incorporating principles from quantum mechanics [
50]. QPSO addresses environments where the best solution may shift over time. Two additional control parameters appear in QPSO:
s, the number of quantum particles, and
r, the radius of a quantum cloud around the global best position. Quantum particles are sampled from a probability distribution centered at
, and the update rule is
where
d denotes a probability distribution and
defines the quantum radius. Particles outside the quantum cloud follow the PSO update rules in Equations (
4) and (
5).
Both PSO and QPSO require suitable values for
,
, and
. Harrison et al. (2017) [
51] propose a self-adaptive procedure based on the stability condition of Poli and Broomhead (2007) [
52],
New values for
,
, and
are selected after every
k iterations, with
recommended by Harrison et al. (2017) [
51]. The additional QPSO parameters
s,
d, and
remain user defined. Blackwell and Bentley (2002) [
50] suggest a uniform distribution for
d, although Harrison et al. (2015) [
53] show that a uniform distribution may perform poorly in certain dynamic environments. A uniform distribution is used in this study to maintain a simple and consistent baseline. Harrison et al. (2015) [
53] also note that smaller values of
are preferable in environments with mild changes.
As part of the ensemble development process in this study, self-adaptive versions of PSO and QPSO adjust the contribution of each ensemble member in the voting mechanism. Each member receives a weight that determines its influence on the final prediction.
2.8. Comparison of Financial Time Series Techniques
Financial time series modeling spans a broad methodological landscape across econometric, machine learning (ML), deep learning (DL), and ensemble learning domains. Each domain introduces distinct assumptions, data requirements, and performance characteristics, which results in a diverse and often fragmented body of research. To establish a clear context for the methodological choices adopted in this study,
Table 3 presents a comparative summary of representative contributions across these domains. The table outlines the modeling techniques used, the type of data analyzed, the primary task addressed, and the main outcomes reported. This structured overview provides a foundation for understanding the range of approaches applied to forecasting and classification tasks in FTS settings and clarifies the methodological gaps that motivate the unified evaluation conducted in the present study.
The studies in
Table 3 reveal several consistent patterns across the FTS literature. Econometric models such as ARIMA and GARCH provide essential baselines for linear dynamics and volatility structure, although these models offer limited capacity for nonlinear behavior. ML and DL methods introduce greater flexibility, with evidence that classifier performance varies across data sets and that architectures such as LSTMs can exceed the performance of traditional baselines. Ensemble methods appear across all domains, which reflects a broad consensus that a combination of learners improves robustness and predictive accuracy. However, existing ensemble learning studies typically address isolated components such as pruning, weighting, or bootstrap diversity, rather than evaluating a complete ensemble design pipeline. As a result, prior research provides valuable but incomplete insights. The present study addresses this gap by conducting an end-to-end comparison of ML, DL, and econometric baselines within a unified ensemble modeling framework for FTS classification.
3. Empirical Process
The empirical process defines the sequence of steps required to transform raw financial time series into structured, statistically assessed, and consistently labeled data sets for subsequent model development. This process includes data acquisition, data quality assessment, data corrections, segmentation into intervals, stationarity analysis, entropy analysis, and class label construction. These steps establish a reproducible foundation for the evaluation of ensemble models. This section presents the workflow structure, the data sets used, the preprocessing procedures applied, and the metrics selected for performance evaluation.
3.1. Empirical Workflow Overview
Figure 2 presents a structured workflow that defines the complete empirical process used in this study. The workflow begins with raw FTS data and proceeds through a sequence of data preparation and analysis steps. These steps include a data quality assessment, the application of data corrections, the segmentation of each data set into intervals, and statistical assessments based on stationarity and entropy. A labeling procedure follows, together with an evaluation of the resulting class distributions.
The second stage of the workflow focuses on ensemble model development. Bootstrap sample generation provides the diversity required for bagging ensemble construction. Each ensemble is formed from identical base learners trained on distinct bootstrap samples. A metaheuristic optimization procedure then assigns a weight to each ensemble member to determine its contribution to the final prediction. Baseline models are trained on the same input data to provide reference performance.
The final stage of the workflow evaluates the predictive performance of the ensemble models. Performance metrics quantify accuracy and profit, and a comparative analysis contrasts ensemble performance with the performance of baseline models. This workflow establishes a consistent and reproducible process for all data sets and ensemble configurations used in this study.
3.2. Datasets
Six financial time series data sets form the basis of the empirical analysis. Each data set was obtained from the HistData repository (HistData.com provides historical market data for research and educational use. The platform specifies that the data are offered without warranty and may contain gaps or irregularities due to market conditions or data collection constraints), which supplies one-minute resolution price data for multiple asset classes. The selection includes two commodity series, two stock index series, and two exchange rate series to ensure diversity across market types and volatility regimes.
The data sets consist of Brent Crude Oil in United States Dollars (BCOUSD), Gold in United States Dollars (XAUUSD), the EURO STOXX 50 index in EUROs (ETXEUR), the Nikkei 225 index in Japanese Yen (JPXJPY), the United States Dollar to South African Rand exchange rate (USDZAR), and the Japanese Yen to South African Rand exchange rate (JPYZAR). Each data set contains five features, namely open, close, high, and low prices as well as the reported trading volume, all recorded at one-minute intervals.
Table 4 summarizes the date ranges, total observations, number of one-day samples, and asset class categories for each series.
3.3. Data Preprocessing
The data preprocessing stage establishes the structural and statistical integrity required for all subsequent empirical analysis. This stage includes a data quality assessment and the application of corrective procedures, followed by stationarity and entropy evaluations, and the construction of class labels. Each component contributes to a consistent and reproducible preparation of FTS data sets.
3.3.1. Data Quality Assessments and Corrections
The data quality assessment identified three structural issues across all data sets. First, missing values appeared at irregular one-minute intervals for multiple features. Second, price observations were recorded during weekends despite the absence of active trading. Third, the volume feature contained no usable information, with extended sequences of zero values across all series.
Corrective procedures were applied to address these issues. Missing values were corrected through linear interpolation between the nearest valid observations to preserve the one-minute sampling structure [
56]. Weekend observations were removed to align each series with standard market trading days [
57]. The volume feature was excluded from further analysis due to the absence of reliable information. Each data set was then segmented into one-day intervals covering the period 09:00 to 17:00 to reduce computational requirements and to focus the analysis on periods with higher price variability [
57].
3.3.2. Stationarity
The ADF [
27] and KPSS [
28] tests were applied to each data set at both the one-day level and the full-series level.
Table 5 summarizes the outcomes. The results indicate weak stationarity across all series, with evidence of a unit root. This outcome suggests the presence of an underlying stochastic process that drives price evolution in each FTS. Such behavior is consistent with the influence of market forces, economic conditions, and geopolitical factors on asset prices [
58].
3.3.3. Entropy
Shannon’s (1948) entropy measure was applied to each FTS, together with conditional entropy and mutual information as defined in later information-theoretic work [
59].
Table 6 summarizes the results in units of information. The mutual information values indicate high levels of shared information across the open, close, high, and low price features. Conversely, the conditional entropy values indicate low levels of new information contributed by each feature. These outcomes are expected because all features represent different views of the same underlying price process. High mutual information reflects the shared structure of FTS, while low conditional entropy reflects the limited amount of unique information available from each feature.
3.3.4. Labeling
The labeling procedure defines a three-class classification structure for each FTS. The three classes correspond to buy, sell, and do-nothing actions, and each class reflects a directional decision for the next trading day. Framing the problem as a three-class classification task aligns with practical trading behavior, where the objective is not to forecast precise price levels, but to determine whether a position should be taken or avoided. This structure mirrors real-world execution choices and provides a stable alternative to point forecasting in noisy intra-day environments [
57,
60,
61].
The construction of these classes relies on the distribution of daily price differences and on a set of financial indicators that provide directional signals.
Figure 3 illustrates the distribution of daily differences between the open and close prices for the USDZAR exchange rate. Similar distributions were observed across all included data sets. Each distribution exhibits a near-normal shape with a high concentration of values around zero. This behavior aligns with the outcomes of the stationarity and entropy analyses. The weak stationarity observed in
Section 3.3.2 implies that each series fluctuates around a slowly evolving mean, which contributes to the clustering of daily differences near zero. The entropy results further indicate high mutual information and low conditional entropy across features, which suggests that the open and close price series share substantial information and provide limited new information individually. These properties collectively explain the concentration of small daily movements and the approximate normality of the daily difference distributions.
The concentration of values near zero motivates the introduction of an offset region around the center of each distribution. This region defines a do-nothing class and excludes observations where the daily difference is too small to provide a reliable directional signal. The exclusion is justified by two considerations. First, the statistical properties of the series indicate that small positive and negative movements arise from similar underlying patterns, which reduces the ability of any model to distinguish between them. Second, common trading practice avoids directional decisions when price movements fall within a narrow range, since such movements do not justify a meaningful position. The offset therefore removes ambiguous cases and produces a clearer separation between buy and sell classes.
Directional labels outside the offset region are determined using three financial indicators, namely the average directional index (ADX) [
62], the relative strength index (RSI) [
63], and a moving average (MA) [
64]. These indicators are widely used in practice and capture complementary aspects of market behavior, including trend strength, momentum, and mean reversion. Using three indicators provides a balance between signal diversity and interpretability, avoiding the instability that arises when relying on a single indicator or an overly complex indicator set. A buy or sell label is assigned when at least two indicators agree on the direction. This majority rule reduces the subjectivity associated with any single indicator and ensures that each label reflects a consistent and interpretable directional signal. A do-nothing label is assigned when the indicators do not agree or when the daily difference falls within the offset region.
Figure 4 illustrates the ADX and RSI rules applied to determine regions for the buy, sell, and do-nothing directional labels.
Table 7 summarizes the offset values and resulting class distributions for each data set. The distributions exhibit varying degrees of class imbalance, which reflect the natural behavior of real-world financial markets rather than any artifact of the labeling procedure. No resampling techniques were applied, as oversampling or weighting would introduce synthetic patterns and distort the empirical structure of directional movements. Offset sensitivity analysis confirmed that class proportions remained stable across reasonable offset choices, supporting the robustness of the labeling strategy. The resulting labels provide a consistent and interpretable structure for the classification task presented in this study.
3.4. Evaluation Metrics
Two evaluation metrics are used to assess the performance of the ensemble models. These are a classification accuracy metric and a profit metric. These metrics provide complementary perspectives on model performance. The accuracy metric evaluates the correctness of the predicted class labels, while the profit metric evaluates the financial value of the predicted trading actions. To determine whether the ensemble model provides a statistically significant improvement over a baseline model, the accuracy and profit outcomes obtained across k cross-validation folds are compared using a set of paired statistical tests. These tests evaluate both the magnitude and the direction of the performance differences between the two models.
3.4.1. Classification Accuracy
The classification accuracy measures the proportion of correctly predicted labels relative to the total number of predictions. Let
denote the true class label for day
i, and let
denote the predicted class label. The accuracy is defined as
where
N is the total number of one-day samples and
is the indicator function. This metric provides a direct measure of the model’s ability to identify the correct trading action.
3.4.2. Profit Metric
The profit metric evaluates the realized financial outcome of the predicted trading actions. For each day
i, let
denote the realized price difference between the open and close prices of the next trading day. A positive value of
indicates a profitable buy action, while a negative value indicates a profitable sell action. Let
denote the predicted action, where
. The profit for day
i is defined as
The total profit across all days is then
This metric captures the cumulative financial value of the model’s predictions and provides an application-oriented evaluation aligned with trading practice.
3.4.3. Statistical Comparison of Model Performance
Several statistical tests are applied to evaluate whether the ensemble model significantly outperforms the baseline model across paired observations obtained from a k-fold cross-validation procedure. Let denote the paired difference between the ensemble and baseline performance metrics for fold i, where .
Wilcoxon Signed-Rank Test
The Wilcoxon signed-rank test [
65] evaluates whether the median of the paired differences obtained across the
k cross-validation folds differs from zero. The test ranks the absolute paired differences
and assigns signs based on the direction of each difference. The test statistic is the sum of the signed ranks. This non-parametric test is appropriate for model comparison because it does not assume normality of the fold-level paired differences.
Sign Test
The sign test [
66] evaluates whether the number of positive paired differences obtained across the
k cross-validation folds exceeds the number expected under the null hypothesis of no performance difference. The test statistic is the count of positive signs in the paired differences. This test provides a distribution-free measure of directional dominance and does not require any assumptions regarding the distribution of the fold-level performance differences.
Cliff’s Delta
Cliff’s delta (
) [
67] measures the effect size by quantifying the degree to which the ensemble model outperforms the baseline model across the
k cross-validation folds. It is defined as
where
denotes the paired performance difference for fold
i. Values close to 1 indicate strong dominance of the ensemble model, while values near 0 indicate negligible differences between the models.
Bootstrap Confidence Intervals
A non-parametric bootstrap procedure [
37] is applied to estimate the confidence interval of the mean paired difference across the
k cross-validation folds. The bootstrap resamples the
k paired differences with replacement and computes the empirical distribution of the mean paired difference. The lower and upper bounds of the interval (
CI Low,
CI High) quantify the uncertainty associated with the estimated performance difference between the ensemble and baseline models.
Bayesian Posterior Probabilities
A Bayesian comparison [
68] evaluates the posterior probability that the ensemble model outperforms the baseline model across the
k cross-validation folds. Let
denote the probability that the paired difference
is positive. A Beta prior is placed on
, and the posterior distribution is obtained by updating the prior with the observed counts of positive and negative paired differences. The posterior probabilities quantify the likelihood that a specific model was superior, given the observed fold-level performance differences.
4. Model Development
The ensemble model design for this study is structured around four core components comprising the base models, the bootstrap methods used to generate resampled training sets, the voting mechanisms that aggregate member predictions, and the ensemble sizes evaluated. Each component is examined systematically to assess its contribution to overall ensemble performance.
4.1. Base Models
Five predictive models are developed for this study. Three of these models serve as ensemble base learners, namely a decision tree (DT) [
69], a logistic regression (LR) [
70], and a multi-layered perceptron (MLP) [
71]. Two additional standalone baseline models are included for comparative evaluation, namely an autoregressive integrated moving average (ARIMA) model [
2] and a long short-term memory (LSTM) network [
72]. The combined set of models provide a diverse collection of linear, non-linear, parametric, and non-parametric approaches. This ensures that the ensemble and baseline comparisons span a broad methodological spectrum.
The DT, LR, and MLP models are selected as ensemble base learners due to their simplicity, computational efficiency, and complementary modeling characteristics. Each model forms the basis of an independent bagging ensemble, allowing the study to evaluate how bootstrap-based resampling interacts with different classes of predictive models. The DT provides a non-parametric, rule-based classifier capable of capturing local decision boundaries. The LR provides a linear probabilistic classifier that offers interpretability and robustness in high-dimensional settings. The MLP provides a non-linear neural network classifier capable of modeling complex feature interactions. This variety enables a systematic assessment of how model class influences the behavior of bootstrap ensembles in FTS classification [
73]. Hyperparameter optimization is intentionally omitted for the base learners, as the objective of this study is to isolate and evaluate the effects of bootstrap resampling and voting mechanisms rather than to optimize individual model performance.
The key control parameters for all five models are summarized in
Table 8. The three ensemble base learners use the default parameters of the
scikit-learn library [
74], except for the MLP, where the hidden layer size is set to 160 units (33.3% of the input dimensionality). The reduced hidden layer size decreases the number of trainable weights, thereby reducing computational cost while maintaining sufficient representational capacity [
75].
The ARIMA model provides a classical linear time series baseline capable of modeling short-term autocorrelation structures. The model is fitted to the daily closing prices extracted from each FTS. The daily series is differenced once to remove non-stationarity. The Akaike Information Criterion (AIC) is used to guide model selection, where AIC provides a likelihood-based measure of model quality that penalizes excessive model complexity, thereby balancing goodness of fit with model simplicity [
79]. An AIC-based grid search is performed over
to select the ARIMA order
. Multi-step forecasts are generated for the test horizon, and the forecasted price changes are converted into directional classes using a volatility-scaled offset. The offset is defined as
, where
denotes the standard deviation of the forecasted price changes. This offset introduces a neutral zone around zero, ensuring that small or uncertain forecasted movements are classified as no-action decisions, which aligns with the labeling strategy. The ARIMA model therefore provides a benchmark for linear autoregressive behavior in financial time series [
80]. The performance of an ARIMA model depends directly on its order parameters. Therefore, control parameter tuning is required to obtain a well-specified linear benchmark, whereas the other baseline models operate in standard configurations that do not require structural hyperparameter selection.
The LSTM model provides a non-linear deep learning baseline capable of capturing short-term temporal dependencies. The network consists of a single LSTM layer with 32 hidden units followed by a dense output layer with three units and a softmax activation function. The hidden size of 32 units provides a lightweight architecture with sufficient capacity to model intra-day temporal structure while limiting the number of trainable parameters and reducing the risk of overfitting. The model is trained using the
adam optimizer with a learning rate of 0.001 and the sparse categorical cross-entropy loss function. The input to the model is a fixed-length sequence of intra-day observations, and the output corresponds to the three directional classes used throughout the study. This model provides a benchmark for recurrent neural architectures commonly applied to FTS classification [
55].
4.2. Bootstrap Methods
Six bootstrap methods for time series data are used in the development of the ensemble models. These comprise of the block, moving-block, sieve-block, Tukey, and stationary bootstrap procedures discussed in
Section 2.6, and are implemented using the
tsbootstrap package in Python 3.10.12. In addition to these established approaches, a sub-sampling method is developed and evaluated alongside the existing techniques. The block length used for generating bootstrap samples is set to 480, corresponding to the number of intra-day observations contained within a single day segment of each FTS.
4.3. Voting Mechanisms
Two voting mechanisms are compared, namely an equal-weighted majority voting scheme, in which each ensemble member contributes an identical vote, and an optimized weighted voting scheme, in which ensemble members are assigned weights that determine their relative influence [
9]. The optimized voting mechanism is implemented using two metaheuristic algorithms, namely PSO and QPSO. The selection of PSO is informed by the comprehensive review of Rezk and Selim (2024) [
18], while QPSO is included due to its ability to capture the dynamic and non-stationary characteristics inherent in FTS data.
The control parameters for both PSO and QPSO are adapted using a self-adaptive strategy based on the stability-guided approaches proposed by Harrison et al. (2015, 2017) [
51,
53], as discussed in
Section 2.7. The remaining algorithm-specific parameters are listed in
Table 9.
In addition, two loss functions are used independently as objective functions within the PSO and QPSO optimization processes to evaluate ensemble performance under different criteria. The entropy-based loss function aligns with directional accuracy by penalizing incorrect predictions uniformly, whereas the profit-based loss function evaluates the realized profit or loss associated with each prediction, thereby capturing model profitability. The motivation for employing two distinct loss functions follows from
Figure 3, which illustrates the non-linear relationship between accuracy and profitability.
4.4. Ensemble Sizes
Five variations of each ensemble model are developed, consisting of ensembles with 10, 30, 50, 100, and 150 members, respectively. These sizes allow an examination of how ensemble performance evolves as the number of constituent members increases, ranging from small ensembles to more computationally intensive configurations.
4.5. Training Procedure
Three ensemble models are developed using DT, LR, and MLP base learners combined with six bootstrap methods. Each base model–bootstrap combination is evaluated across five ensemble sizes of 10, 30, 50, 100, and 150 members. Model training and evaluation are performed using the time series cross-validation procedure available in the
scikit-learn package [
74], with ten cross-validation folds applied for each voting mechanism. The choice of ten folds provides a balance between computational efficiency and robust performance estimation across the temporal structure of each FTS.
Three voting mechanisms are examined, namely equal-weighted majority voting, and PSO- and QPSO-based optimized voting. After each permutation run, the accuracy and profit metrics are recorded for subsequent analysis. Results are summarized using a wins-based approach, where wins correspond to the highest mean performance across datasets for a given metric. Standard deviations are used to confirm the robustness of each winner, with lower variability indicating more stable performance. Wins therefore identify the best average accuracy and best average profitability achieved across the experimental configurations.
The baseline models are trained using the same time series cross-validation procedure applied to the ensembles, ensuring that all models are evaluated under an identical temporal structure. Each baseline model is fitted on the training portion of each fold and assessed on the corresponding test segment, preserving chronological order throughout. Standard configurations are used for all baseline methods, with ARIMA being the only model requiring structural tuning due to its dependence on order parameters. This alignment with the ensemble training process provides a consistent and comparable evaluation framework across all models.
5. Ensemble Results
This section presents and discusses the empirical results of the ensemble framework. The overall performance of the ensemble models is evaluated across base learners, bootstrap methods, voting mechanisms, ensemble sizes, and objective functions. Detailed analyses are then provided to examine how each design component influences predictive accuracy and profitability. This offers a comprehensive view of the factors that shape ensemble performance for the FTS included in this study.
5.1. Overall Performance
The LR ensemble model provides the strongest overall performance across all data sets, bootstrap methods, voting mechanisms, and ensemble sizes, as shown in
Table 10 and
Table 11. The LR ensemble consistently outperforms the DT and MLP ensembles on both the accuracy and profit metrics, with the exception of the USDZAR data set, where the MLP ensemble achieves a marginally higher profitability. These results indicate that the LR ensemble is able to capture the dominant directional structure of an FTS more reliably than the DT and MLP ensembles.
The superior performance of the LR ensemble can be explained by the bias–variance characteristics of the underlying learners. LR is a high-bias, low-variance model [
73], and when combined with bootstrap aggregation, its stable decision boundary becomes more robust to the noise and microstructure irregularities present in intra-day financial data [
81]. In contrast, the MLP is a low-bias, high-variance model whose performance is sensitive to hyperparameter tuning [
75]. Without extensive tuning, the MLP tends to overfit short-lived fluctuations that do not generalize across bootstrap samples or cross-validation folds. The DT ensemble exhibits similar behavior. Although bagging reduces variance, the underlying tree structure remains sensitive to small perturbations in the data [
69], which limits its ability to generalize in weakly stationary environments.
These observations align with the statistical properties of the data sets analyzed in
Section 3.3.2 and
Section 3.3.3. The weak stationarity and high mutual information across features imply that the directional signal is relatively smooth and dominated by broad, persistent tendencies rather than complex nonlinear interactions [
82]. In such settings, linear decision boundaries often outperform more flexible nonlinear models, particularly when the latter are not extensively tuned [
55]. The LR ensemble therefore benefits from a favorable alignment between model structure and the underlying data-generating process.
The Tukey bootstrap method provides the strongest overall performance across the data sets, as shown in
Table 12. The tapering applied at block boundaries produces smoother transitions between adjacent segments, which better preserves the local autocorrelation and volatility clustering inherent in FTS [
41]. Methods with hard block boundaries introduce artificial discontinuities that distort short-term temporal structure. The stationary bootstrap performs well on some data sets, reflecting the suitability of its block-based resampling for certain local patterns, but the Tukey method provides the most consistent performance across both accuracy and profitability.
Table 12 also summarizes the performance of the voting mechanisms. The optimized weighted voting mechanisms (PSO and QPSO) outperform equal-weighted majority voting on the profitability metric across all data sets. This behavior reflects the ability of PSO and QPSO to identify weight configurations that emphasize ensemble members capturing rare but profitable directional movements [
18,
83]. The QPSO algorithm benefits from a more global search capability, which reduces the likelihood of converging to local optima in the profit landscape [
83].
The preferred ensemble size across the FTS data sets is 50, as shown in
Table 13. This result aligns with ensemble theory, which shows that increasing the number of members reduces variance up to a saturation point, after which additional members contribute diminishing returns due to increasing correlation among bootstrap samples [
73]. In FTS, where directional signals are weak and noisy, excessively large ensembles may oversmooth the signal and reduce sensitivity to rare but profitable deviations. The diversity observed across data sets reflects underlying differences in market microstructure and volatility regimes, where each instrument exhibits distinct patterns of liquidity, noise, and volatility clustering that influence ensemble behavior [
81].
Table 13 also highlights the contrasting behavior of the entropy and profit loss functions. The entropy loss function consistently yields higher accuracy, as it rewards correct predictions uniformly and therefore encourages the optimization algorithms to maximize classification performance [
84]. In contrast, the profit loss function prioritizes trades with higher financial impact, even if this results in lower overall accuracy. This divergence reflects an established property of financial prediction tasks, accuracy and profitability are not linearly related [
85]. A model may achieve modest accuracy while still capturing a small number of highly profitable directional movements. The overall findings reinforce the importance of optimizing ensemble voting weights, as this enables an ensemble to target profitability more effectively than accuracy-driven configurations.
5.2. Performance of Bootstrap Methods
Table 14 summarizes the performance of the six bootstrap methods across the three ensemble model types. The totals reflect the number of data sets (six in total) for which a bootstrap method achieved the best performance under a given metric. The results show a clear preference for the Tukey method in both the DT and MLP ensembles, where it dominates across accuracy and profit. This behavior is consistent with the overall findings in
Section 5.1, where the Tukey method frequently produced the most stable and profitable ensembles.
The LR ensembles exhibit a more heterogeneous pattern. For accuracy, the subsample method is preferred, while the profit metric shows no single dominant bootstrap method. This divergence reflects the sensitivity of LR ensembles to the structure of the resampled data, where different bootstrap methods preserve different aspects of the underlying temporal dependencies. Overall, the results indicate that the Tukey method is the most robust choice for DT and MLP ensembles, but not necessarily for LR ensembles.
Table 15 examines bootstrap performance from the perspective of the voting mechanism. Each metric contains 36 observations, corresponding to six data sets evaluated under six bootstrap methods. Unlike the model-specific results, no bootstrap method consistently dominates across the voting mechanisms. This suggests that the choice of bootstrap method does not materially influence the behavior of the voting mechanism itself. Instead, the voting mechanism appears to respond primarily to the ensemble’s predictive structure rather than the resampling scheme used to generate its members.
Table 16 evaluates bootstrap performance across ensemble sizes. The Tukey method again performs strongly, achieving the highest number of wins across nearly all ensemble sizes for both accuracy and profit. The stationary bootstrap is the only meaningful competitor, particularly at larger ensemble sizes, and highlights its ability to preserve short-range dependencies. These results indicate that the Tukey and stationary methods are the most effective at generating diverse yet structurally coherent bootstrap samples, which in turn support more stable ensemble performance.
5.3. Performance of Voting Mechanisms
Table 15 summarizes the performance of the voting mechanisms across bootstrap methods. The accuracy results show a slight preference for the equal-weighted majority voting mechanism, although the combined performance of the PSO and QPSO mechanisms indicates that optimized voting weights frequently outperform majority voting. This pattern suggests that while majority voting provides a stable baseline, optimized weighting can yield additional gains when the underlying models benefit from differential contribution strengths.
Table 17 further illustrates this behavior across model types. QPSO is the most effective voting mechanism overall, achieving the highest number of wins for both accuracy and profitability. The DT ensembles are an exception, where majority voting performs best under the accuracy metric. This indicates that DT ensembles benefit from uniform weighting, likely due to their high variance and the stabilizing effect of equal contributions. In contrast, LR and MLP ensembles benefit more from optimized weighting, where QPSO consistently identifies more effective voting configurations.
Table 18 examines performance across ensemble sizes. The PSO mechanism is preferred for most ensemble sizes under both accuracy and profit, indicating that optimized weighting becomes increasingly beneficial as the ensemble grows. Majority voting remains competitive for accuracy, particularly at smaller ensemble sizes, but does not match the profitability achieved by PSO or QPSO. These results suggest that optimization plays a more important role when ensembles become larger and more diverse, where uniform weighting may fail to capture the relative strengths of individual members.
Table 19 evaluates the loss functions used within the PSO and QPSO mechanisms. As expected, the entropy loss function leads to higher accuracy, since the optimization process is designed to maximize the number of correct predictions. Conversely, the profit loss function yields higher profitability, as the search process explicitly targets profitable directional movements. The consistency of these results across both PSO and QPSO indicates that the choice of loss function is the primary determinant of whether the ensemble prioritizes accuracy or profitability.
The broader results in
Table 20,
Table 21 and
Table 22 reinforce this pattern. The entropy loss function aligns strongly with the accuracy metric, while the profit loss function aligns with profitability. Instances where the profit loss function also improves accuracy indicate that the optimization process has identified configurations that simultaneously enhance predictive correctness and financial performance. Such outcomes are particularly desirable because the choice of objective function determines whether the optimization process prioritizes predictive accuracy or financial performance.
5.4. Performance Impact on Ensemble Sizes
As discussed in
Section 5.1, ensembles with 50 members provide the strongest overall performance across the majority of data sets.
Table 23 offers additional insight by comparing model types across ensemble sizes. The LR ensembles achieve the highest accuracy at all ensemble sizes, while the MLP ensembles rival LR performance only in terms of profitability. The LR ensembles also show marginally stronger profitability at larger ensemble sizes, suggesting that linear decision boundaries benefit from the increased stability associated with larger ensembles [
86].
Table 13 further shows that although 50 member ensembles perform well on average, the preferred ensemble size differs between the accuracy and profitability metrics. Ensemble sizes below 50 do not appear among the accuracy winners, yet they do appear among the profitability winners. This divergence reflects the fact that ensemble size influences different performance metrics in distinct ways. Larger ensembles tend to reduce variance and improve accuracy, while smaller ensembles may preserve directional characteristics that contribute to higher profitability [
87].
The results in
Table 16 and
Table 18 show no strong dependency between ensemble size and the choice of bootstrap method or voting mechanism. However, these tables reinforce the broader patterns observed earlier. The Tukey bootstrap method remains the most effective across ensemble sizes, and PSO-based weighted voting mechanisms consistently outperform majority voting in profitability. These findings align with the general principle that ensemble performance depends not only on the number of members but also on the diversity and weighting of those members [
88].
6. Statistical Evaluation of Ensembles vs. Base Line Models
The statistical evaluation in this section quantifies the performance differences between the developed ensemble models and the baseline models introduced in
Section 4.1 across all six FTS data sets. The aim is to determine whether the observed improvements in accuracy and profit reflect consistent differences rather than random variation. To support this, a set of non-parametric statistical tests is applied to the paired performance results, and these tests report both the direction and the magnitude of the differences. These tests form the basis for the comparative results presented in this section.
6.1. Formulation of Statistical Tests
The statistical evaluation in this study assesses whether the ensemble models offer consistent improvements over the baseline models introduced in
Section 4.1. The baselines include ARIMA, DT, LR, MLP, and LSTM models, which cover a range of approaches commonly applied to FTS classification. The ensemble configurations identified in
Section 5 and listed in
Table 24 are evaluated against these baselines to determine whether the observed differences in accuracy and profit are statistically significant.
A suite of non-parametric statistical tests defined in
Section 3.4.3 is applied to the paired performance differences between each ensemble configuration and its corresponding baseline model across all six data sets. The tests include the Wilcoxon signed-rank test, the sign test, Cliff’s delta, and bootstrap confidence intervals. Each reports information on the direction and magnitude of the differences. The Bayesian signed-rank test is also used to estimate the posterior probability that an ensemble model outperforms its baseline counterpart. This combination of tests provides a structured basis for the comparative analysis.
6.2. Findings for the Accuracy Metric
This section interprets the statistical results for the accuracy metric and explains the factors that contribute to the observed performance differences. The ensemble models incorporate several sources of diversity through bootstrap resampling, varying ensemble sizes, and optimized voting mechanisms. These elements allow the ensembles to capture a broader range of patterns within FTS data than the individual baseline models. The optimized voting mechanisms also allow an ensemble to weight members according to their contribution to predictive performance, which reduces the influence of weaker members and improves overall accuracy. The statistical evaluation therefore indicates whether these design choices lead to consistent performance differences rather than outcomes driven by random variation. A summary of the statistical results for the accuracy metric is provided in
Table 25, and the detailed outputs of the statistical tests are presented in
Table A1 in
Appendix A,
Appendix A.1.
Across all six data sets, the accuracy-optimal ensemble configuration demonstrates strong and consistent improvements over the baseline models. This is particularly evident in the BCOUSD data set, where the ensemble achieves very low Wilcoxon signed-rank p-values () against both ARIMA and DT, accompanied by large effect sizes (Cliff’s ) and positive confidence intervals. The Bayesian posterior probabilities further support these findings, with the ensemble achieving values as high as 91.7% against multiple baselines. Similar patterns are observed for ETXEUR and USDZAR, where the ensemble again achieves strong evidence of superiority, reflected by Wilcoxon p-values below 0.0040 and effect sizes above 0.88 for several baselines.
In contrast, the few cases where a baseline model shows a higher posterior probability, primarily the LR baseline model and, in isolated instances, the LSTM model. These are not supported by strong statistical evidence. For example, in the JPXJPY data set, LSTM model attains a Bayesian posterior probability of 58.3%, but this result is accompanied by a Wilcoxon p-value of 0.9219, a negligible effect size (Cliff’s ), and a confidence interval that spans both positive and negative values. Similar patterns are observed in the JPYZAR and XAUUSD data sets, where LR baseline model occasionally shows a slight advantage, but the confidence intervals cross zero and the effect sizes remain small.
These results indicate that the ensemble models provide reliable and stable improvements in accuracy across diverse financial markets. The statistical evidence consistently favors the ensemble models, while baseline advantages are limited, inconsistent, and not supported by strong statistical indicators.
6.3. Findings for the Profit Metric
This section interprets the statistical results for the profit metric and examines the factors that contribute to the observed performance differences. The ensemble models incorporate several design elements that influence profit-based performance, including bootstrap resampling to introduce diversity, the selection of ensemble sizes that balance variance reduction and model stability, and optimized voting mechanisms that weight ensemble members according to their contribution to profit. These choices enable the ensembles to capture profit-relevant patterns within FTS data that may not be fully exploited by the individual baseline models. A summary of the statistical results for the profit metric is provided in
Table 26, and the detailed outputs of the statistical tests are presented in
Table A2 in
Appendix A,
Appendix A.2.
Across all six FTS data sets, the ensemble models demonstrate consistent and statistically supported improvements in profit relative to the baseline models. This is particularly evident in the BCOUSD and JPXJPY data sets, where the ensembles achieve very low Wilcoxon signed-rank p-values (), large effect sizes (Cliff’s ranging from 0.58 to 1.0), and strictly positive confidence intervals. The Bayesian posterior probabilities further support these results, with the ensembles achieving values of 91.7% for BCOUSD and JPXJPY. Similar patterns appear in USDZAR, where the ensemble again shows strong evidence of superiority, reflected by Wilcoxon p-values of , effect sizes above 0.90, and positive confidence intervals.
In contrast, the few cases where a baseline model shows a higher posterior probability are not supported by strong statistical evidence. For example, in the JPYZAR data set, the LR baseline model attains a Bayesian posterior probability of 66.7%, but this result is accompanied by a Wilcoxon p-value of 0.3750, a negative effect size (Cliff’s ), and a confidence interval that spans zero. Similar patterns appear in the ETXEUR and XAUUSD data sets, where the LR baseline model occasionally shows a slight advantage, but the confidence intervals cross zero and the effect sizes remain small.
These results show that the ensemble models achieve clear and consistent gains in profit across the FTS considered. The statistical indicators support these outcomes, while the few apparent baseline advantages lack consistent evidence and do not reflect systematic performance differences.
7. Conclusions and Future Work
This study addressed the need for a comprehensive, end-to-end evaluation of bagging ensemble models for financial time series (FTS) classification. It responded to gaps in the literature related to the interaction of bootstrap methods, ensemble sizes, voting mechanisms, and loss functions. The work examined the full modeling pipeline, from data preprocessing and the construction of a supervised classification problem to the design and evaluation of ensemble configurations across six diverse FTS data sets. The empirical analysis incorporated decision tree (DT), logistic regression (LR), and multi-layer perceptron (MLP) base learners, six time series bootstrap methods, five ensemble sizes, and three voting mechanisms, with additional analysis of the role of entropy- and profit-based loss functions within particle swarm (PSO) and quantum-inspired particle swarm (QPSO) optimization.
The results of this study show that LR-based ensembles provide the strongest overall performance across the six FTS data sets, outperforming the ARIMA, DT, LR, MLP, and LSTM baseline models on both accuracy and profit metrics. The statistical evaluation supports these outcomes, with the ensemble models achieving consistently positive confidence intervals, large effect sizes, and high Bayesian posterior probabilities across most comparisons. Apparent baseline advantages occur only in isolated cases and lack strong statistical support. The choice of the bootstrap method affects performance in model-specific ways. DT and MLP ensembles show their best results under the Tukey bootstrap, while LR ensembles achieve strong performance under the block bootstrap, the sub-sample bootstrap method, and the Tukey bootstrap method. The evaluation also shows that optimized voting mechanisms offer clear advantages over equal-weight majority voting, with the profit-based loss function producing the most consistent gains in this study. The analysis of ensemble size further indicates that FTS classification problems exhibit an optimal range of ensemble members, as larger ensembles do not always yield additional improvements and may reduce performance in certain cases. These findings collectively show that ensemble performance depends on the interaction of bootstrap diversity, ensemble size, and voting strategy, and that careful design choices are necessary to achieve reliable improvements in FTS classification.
Practical Implications. The findings of this study offer several practical insights for FTS practitioners. First, LR-based ensembles provide a reliable and interpretable foundation for directional classification tasks across diverse market conditions. Second, the selection of bootstrap methods should reflect the characteristics of the base learner. Tukey provides strong and consistent performance for DT and MLP ensembles, while LR ensembles achieve their best results under the block bootstrap, the sub-sample bootstrap, and the Tukey bootstrap method. As a result, no single resampling strategy dominates across all learners. Third, optimized voting mechanisms offer clear advantages over equal-weight majority voting, and profit-oriented loss functions provide the most consistent improvements in this study. Finally, the identification of optimal ensemble sizes highlights the importance of balancing diversity with computational efficiency, especially in real-time or resource-constrained environments.
Limitations. Several limitations should be acknowledged. The analysis is restricted to six FTS data sets, which, although diverse, do not capture the full range of market regimes or structural characteristics. The study also focuses on three base learners and a specific family of optimization algorithms, leaving open the question of how alternative model classes or optimization strategies might behave under similar ensemble designs. In addition, the profit metric used in this study does not incorporate transaction costs or market frictions, which may influence real-world applicability. These limitations provide opportunities for further investigation.
Future Work. Future research may extend this study in several directions. One avenue is the incorporation of kernel-based methods to explore nonlinear extensions of LR ensembles and to assess their effect on model bias and predictive stability. Another direction involves variation in the number of iterations in the self-adaptive PSO and QPSO algorithms to better understand their convergence behavior under different loss functions. A multi-objective optimization approach that combines entropy and profit loss functions may also produce more balanced ensemble designs. Further work may examine alternative sampling distributions for the quantum cloud used in QPSO and expand the empirical evaluation to include additional FTS data sets with different structural characteristics. These directions also address several limitations of the current study, which include the use of a finite set of FTS data sets, a focus on three base learners, and the exclusion of transaction costs and market frictions from the profit metric. By broadening the empirical scope and exploring additional modeling components, future research would deepen the understanding of ensemble behavior in FTS classification and strengthen the practical relevance of the results.