Ensemble Approach for Financial Time Series Modeling

Nannoolal, Aveer; Engelbrecht, Andries P.

doi:10.3390/a19050404

Open AccessArticle

Ensemble Approach for Financial Time Series Modeling

by

Aveer Nannoolal

^1,*

and

Andries P. Engelbrecht

^1,2

¹

Department of Industrial Engineering, Stellenbosch University, Stellenbosch 7602, South Africa

²

Division of Computer Science, Stellenbosch University, Stellenbosch 7602, South Africa

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(5), 404; https://doi.org/10.3390/a19050404

Submission received: 9 March 2026 / Revised: 11 April 2026 / Accepted: 20 April 2026 / Published: 18 May 2026

Download

Browse Figures

Versions Notes

Abstract

This study provides a comprehensive evaluation of bagging ensemble models for financial time series (FTS) classification and addresses a gap in the literature regarding how bootstrap methods, ensemble sizes, voting mechanisms, and loss functions jointly influence model performance. The analysis evaluates decision tree (DT), logistic regression (LR), and multi-layer perceptron (MLP) ensemble models modified by six time series bootstrap methods, five ensemble sizes, and three voting mechanisms across six FTS data sets. The study also examines the influence of entropy- and profit-based loss functions within particle swarm (PSO) and quantum-inspired particle swarm (QPSO) optimization for weighted voting. The results show that LR-based ensembles provide the strongest overall performance and outperform ARIMA, DT, LR, MLP, and LSTM baseline models on both accuracy and profit metrics. Bootstrap effects are model specific. DT and MLP ensembles perform best under the Tukey bootstrap, while LR ensembles achieve strong results under the block bootstrap, the sub-sample bootstrap method, and the Tukey method, and remain the strongest performers across all bootstrap configurations. Optimized voting mechanisms yield clear improvements over equal-weight majority voting, with the profit loss function producing the most consistent gains. The findings also indicate that FTS classification problems exhibit an optimal range of ensemble sizes, as larger ensembles do not always improve performance. The study contributes a systematic assessment of ensemble design choices for FTS classification and highlights the importance of jointly considering bootstrap diversity, ensemble size, and voting strategy when developing ensemble models for financial applications.

Keywords:

financial time series classification; bagging ensembles; bootstrap methods; weighted voting; particle swarm optimization; quantum-inspired optimization

1. Introduction

Numerous analysis and modeling techniques have been developed for time series data [1]. The earliest approaches include statistical methods such as the Box–Jenkins method developed by George Box in the 1970s [2], where analyses were performed on relatively small data sets. Since then, advances in computing technologies, data storage, and electronic data acquisition have enabled the collection and processing of large-scale time series data. Modern approaches to time series modeling include a wide range of machine learning and deep learning methods [3], with examples such as Gramian angular fields (GAF) and Markov transition fields (MTF) [4], heuristic methods [5], and ensemble learning [6].

Financial time series (FTS) is a subset of the time series domain and represents the observed value of a financial asset over time. The analysis of financial markets plays a significant role in economic decision-making and influences the behavior of individuals and institutions. Due to the importance of FTS, numerous modeling techniques have been developed to predict movements within financial markets [7]. However, the irregular, noisy, and non-stationary characteristics of FTS make modeling a complex and error-prone task [7].

A key challenge in FTS modeling is that no single modeling approach consistently outperforms others across different data sets, as suggested by the no free lunch theorem (NFLT) [8]. Each modeling approach is problem-specific, and performance varies depending on the characteristics of the underlying data. Ensemble learning seeks to address this challenge by improving the generalization capabilities of individual models through the combination of multiple learners [9]. Ensemble learning is considered a meta-approach to machine learning [10], where the focus is on the process of learning rather than the performance of a single model [11]. Ensemble models can be constructed using a variety of machine learning algorithms and are typically developed using bagging, boosting, or stacking strategies [10].

Existing research on ensemble modeling for FTS has primarily focused on isolated components of the ensemble design process. These components include improvements in base models [12,13,14,15], feature engineering [16,17,18], and architectural variations [19,20,21]. However, the literature does not provide a comprehensive investigation into the end-to-end design of bagging ensemble models for FTS classification problems [22]. Current studies seldom examine how multiple ensemble components, which include bootstrap methods, ensemble sizes, voting mechanisms, and loss functions, interact to influence predictive performance. Furthermore, prior work rarely evaluates ensemble design choices across a diverse set of baseline models that span econometric, machine learning, and deep learning domains.

This study aims to address these gaps by treating ensemble design as a multi-dimensional modeling problem for FTS classification tasks. The contributions of this study include: (i) an end-to-end evaluation framework for bagging ensemble models applied to FTS classification tasks; (ii) the introduction of a sub-sampling bootstrap method tailored for FTS and its comparison with established time series bootstrap techniques; (iii) the development of an optimized ensemble weighting mechanism using particle swarm optimization (PSO) and quantum-inspired particle swarm optimization (QPSO) algorithms; and (iv) a comparative analysis across baseline models from econometrics (autoregressive integrated moving average (ARIMA)), machine learning (decision tree (DT), logistic regression (LR), multi-layer perceptron (MLP)), and deep learning (long short-term memory network (LSTM)), thereby providing a cross-domain evaluation of ensemble performance. The study also includes a generalizable ensemble framework that can be applied to any classifier and remains interpretable.

Bagging ensemble models are developed and evaluated on six FTS data sets, which are transformed into a classification problem where the objective is to predict the direction of movement mapped to a buy, sell, or do-nothing action. This formulation aligns with practical financial decision-making, where directional movements correspond directly to actionable trading outcomes [23]. Although transformer-based and generalized autoregressive conditional heteroskedasticity (GARCH)-type models have shown promise in recent FTS research, they are not included in this study due to the focus on classification-oriented ensemble design rather than sequence-to-sequence forecasting or volatility modeling [24,25]. The main objectives of this study are to analyze the performance of time series bootstrap methods, evaluate optimized voting mechanisms, assess the impact of ensemble sizes, and examine the influence of loss functions used in metaheuristic optimization. A preliminary objective is to perform a detailed data analysis of the six FTS data sets to inform the development of an FTS classification problem. The insights obtained from this analysis guide the ensemble model development process.

The structure of this paper includes a review of related works, a description of the empirical process, the design and development of the ensemble models, a presentation of ensemble modeling results, and statistical evaluation between the best ensemble configuration and baseline models. The paper concludes with final remarks, and proposed directions for future work.

2. Related Works

This section reviews the literature relevant to FTS modeling and analysis. It presents foundational concepts, which include formal definitions of time series and FTS, as well as key analytical properties such as stationarity, entropy, and commonly used financial indicators. The review also outlines methodological components that appear frequently in prior work, which include ensemble methods, bootstrap procedures for dependent data, and metaheuristic optimization strategies. These elements establish the conceptual and methodological background for a comparative analysis that concludes this section.

2.1. Time Series

A time series is a sequence X of observations

x_{1}, \dots, x_{t}

recorded over a period of time. The observations may occur across a continuous interval, at regular sample intervals, or at fixed time points. A time series X is defined as

X = (x_{1}, x_{2}, \dots, x_{t})

where each

x_{i} \in R

and

i \in T

, with

T \subset N

representing the index set. An FTS follows the same structure, with each

x_{i}

interpreted as a financial observation within the sequence.

2.2. Stationarity

Stationarity is a property of a time series that reflects whether its statistical characteristics, such as the mean, variance, and covariance, remain constant over time. A time series is stationary when the process that generates the observations does not depend on time [26]. In such cases, no trends or seasonality appear in the data. A non-stationary series displays time-dependent behavior, with changes in its statistical properties and visible features such as trends or seasonality. An analysis of a non-stationary series must account for this time dependence [26]. Stationarity reduces the complexity of predictive model development, and Table 1 outlines the main forms of stationarity. An FTS benefits from stationarity in the same way, as stable statistical properties simplify model construction. Two approaches assist in determining whether a series arises from a stationary process, namely visual inspection and statistical tests. Visual inspection provides a judgment-based assessment of the data. The statistical tests used in this study are the augmented Dickey–Fuller (ADF) test [27] and the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test [28]. Both tests rely on the presence or absence of a unit root, a feature of a stochastic process [28], and are often applied together to confirm the stationarity properties of a time series.

2.3. Entropy

Entropy is a foundational concept in information theory, introduced by Shannon (1948) [29], and serves as a measure of uncertainty in random variables. In this context, a variable represents a unit of information storage, expressed in bits. Within time series and FTS analysis, entropy provides insight into the level of volatility in a series by distinguishing between common and rare events reflected in the data. The probability of an event determines the amount of information required to represent that event. To apply Shannon’s entropy to a time series X, the probability space is defined by the distribution

p (x)

for each value of X, where

x_{i} \in X

and i denotes the position of x in the sequence. Entropy for a continuous-valued time series is defined as

H (X) = - \int_{- \infty}^{\infty} p (x_{i}) ln p (x_{i}) d x .

(1)

Additional information measures extend this concept. Conditional entropy (CE) [30] quantifies the amount of information required to describe a random variable Y when the value of a related variable X is known. The entropy of Y conditioned on X is written as

H (Y | X)

and is defined as

H (Y | X) = - \sum_{x_{i}, y_{i}} p (x_{i}, y_{i}) {log}_{2} \frac{p (x_{i}, y_{i})}{p (x_{i})} .

(2)

Mutual information (MI) [31] measures the degree of dependence between two random variables. For two time series X and Y, MI is written as

I (X, Y)

and is defined as

I (X, Y) = - \sum_{x_{i}, y_{i}} p (x_{i}, y_{i}) {log}_{2} \frac{p (x_{i}, y_{i})}{p (x_{i}) p (y_{i})} .

(3)

A special case arises when

X = Y

, in which case

I (X, X)

equals

H (X)

, confirming that the information in a variable is fully dependent on itself. These entropy-based measures assist in the analysis of time series and FTS applications, particularly in feature selection tasks [32] where multiple measurements originate from the same series.

2.4. Financial Time Series Indicators

Financial indicators are statistics that assist in the assessment of the soundness, stability, and performance of a financial asset. These indicators also provide insight into economic activity across different sectors. The analysis and development of FTS indicators form part of the technical analysis of an asset. Two categories of indicators are relevant to this study, namely leading indicators [33] and lagging indicators [34]. A leading indicator is a variable that signals future change or movement in another FTS, process, trend, or event before the change occurs. A lagging indicator is a variable that correlates with an FTS and confirms the presence of a trend. Table 2 lists several leading and lagging indicators together with their classification and definitions. The indicators of interest in this study are the average directional index (ADX), relative strength index (RSI), and simple moving average (SMA). These indicators provide direct information about trend strength, momentum, and directional movement, which aligns with the objective of classifying short-term behavior in FTS.

2.5. Ensemble Modeling

An ensemble model combines the predictions of multiple base models or learners. A level of diversification is achieved in an ensemble design, which allows the model to address patterns in complex data sets such as time series data [22]. Three main approaches exist in ensemble model construction, namely bagging, boosting, and stacking. Each approach applies a distinct method that enables members of an ensemble to capture different patterns in separate regions of a data set. A bagging ensemble relies on a statistical procedure known as bootstrapping, which varies samples of the data set during the training phase [35]. The trained members of an ensemble contribute to a final prediction through a voting mechanism. Bagging therefore represents a data-driven approach, and diversification arises from variation in the training data. In contrast, boosting represents a model-driven approach that aims to correct errors from prior models by altering the focus of the learning algorithm [35]. Models are added sequentially, with each new model aiming to correct the errors of the previous one. The data set remains unchanged, but the algorithm assigns weights that direct attention toward correct or incorrect predictions [10]. This weighting mechanism ensures a diverse set of learners that collectively outperform a single base model. Stacking is an ensemble method that combines multiple models to produce an output that exceeds the performance of any individual model [36]. A stacked ensemble uses several models in an initial layer and then uses their outputs as inputs to a final model. The models in the initial layer operate independently, and the training data remains unchanged. This study focuses on the bagging approach. The evaluation includes six bootstrap methods for time series data, metaheuristic procedures for the optimization of voting mechanisms, and an assessment of the effect of ensemble size on classification performance.

2.6. Bootstrap Methods for Time Series

The bootstrap method introduced by Efron (1979) [37] is a statistical procedure that estimates properties of a population by resampling from an observed data set. A bootstrap sample arises by drawing observations one at a time from the data and returning each selected value to the sample space. This procedure may occur with or without replacement. Time series data introduce additional complexity due to the dependence between observations. As a result, any resampling procedure must preserve the dependent structure of a series [38]. This study evaluates six bootstrap methods for time series data that support the development of ensemble models. These methods are the block, moving-block, sieve, stationary, and Tukey bootstrap methods. A sub-sampling method has been developed as a competing method.

The block bootstrap, illustrated in Figure 1a, resamples non-overlapping blocks of data rather than individual observations [39]. The time series is divided into blocks to preserve its dependent structure, and a bootstrap sample arises from a random selection of these blocks. The block size is a key parameter. A small block size may fail to preserve dependence, whereas a large block size may reduce variability in the resampled data [40,41].

The moving-block bootstrap (MBB) [42], shown in Figure 1b, follows the same principle but uses overlapping blocks of fixed length. The block size determines the number of data points in each block, and a bootstrap sample arises from a random selection of these overlapping blocks. This approach aims to preserve the dependent structure of the original series.

The sieve bootstrap introduced by Bühlmann (1997) [43] approximates the residuals of a time series with an autoregressive model. This method preserves the dependent structure of the series by resampling residuals from an autoregressive process of order

p (n)

. The order increases with the sample size under the conditions

p (n) \to \infty

and

p (n) = o (n)

as

n \to \infty

, where is

p (n)

defines the order and magnitude of the autoregressive model and

o (n)

is a function to define the growth rate in the number of samples [43]. A bootstrap sample then arises from the autoregressive model in a manner similar to the MBB.

The stationary bootstrap [41] draws blocks of random length, where the starting index follows a uniform distribution and the block length follows a geometric distribution. This method is suitable for time series that require preservation of dependence but do not satisfy strict stationarity. This characteristic is helpful with modeling real-world data sets. However, it is not recommended for data sets with strong trends.

A sub-sampling method developed by the authors serves as a competing approach. The method uses a block size equal to the data length n divided by the number of required samples m. A random process selects overlapping blocks of fixed length, and each bootstrap sample contains

n / m

observations.

The Tukey bootstrap [44] is a variant of the moving-block bootstrap that applies tapering at the edges of each block to ensure continuity in the bootstrap sample path. The tapering acts as a window function that smoothens the boundaries between adjacent blocks, which reduces discontinuities when blocks are concatenated. This method supports variance estimation, with a specific focus on the estimation of sample means.

2.7. Metaheuristics

Metaheuristics are high-level, problem-independent strategies that guide a search procedure toward a global best solution, without any guarantee of success for a specific optimization problem [45]. A metaheuristic updates a candidate solution through a sequence of steps until a termination criterion indicates that no further improvement is possible. Numerous metaheuristic families exist, including nature-inspired procedures such as particle swarm optimization [46] and metallurgy-inspired procedures such as simulated annealing [47]. Rezk and Selim (2024) [18] provide an overview of metaheuristic procedures used in ensemble model construction, with an emphasis on methods that adjust ensemble member contributions. According to Rezk and Selim (2024) [18], swarm-based algorithms appear as the second most common metaheuristic class after evolutionary algorithms, and particle swarm optimization appears as the second most preferred algorithm after the genetic algorithm [48]. Rezk and Selim (2024) [18] also note that the selection of a metaheuristic depends on problem-specific criteria such as performance, diversity, complexity, and efficiency. Time series data sets arise from dynamic environments with varying levels of complexity; therefore, two metaheuristic procedures are used in this study, namely the particle swarm optimization and quantum-inspired particle swarm optimization. Both procedures are used to adjust ensemble member contributions within a voting mechanism.

The particle swarm optimization (PSO) algorithm is a nature-inspired procedure based on the social behavior of birds. PSO uses a stochastic search process to explore a population of candidate solutions and identify an optimal solution that satisfies predefined criteria [46]. Each candidate solution is a particle and the number of particles is user defined. Each particle is represented by four vectors: the position vector

x_{i} (t)

, the personal best vector

p_{i} (t)

, the velocity vector

v_{i} (t)

, and the global best vector

p_{g} (t)

, which records the best position found by any particle in the swarm. The procedure begins with an initialization of positions and velocities. Velocities are set to zero, and the vectors

p_{i}

and

p_{g}

are initialized to

x_{i} (0)

and the best among the

p_{i} (0)

values. The velocity update rule with inertia, introduced by Shi and Eberhart (1998) [49], is

v_{i} (t + 1) = ω v_{i} (t) + ρ_{1 i} (t) c_{1} [p_{i} (t) - x_{i} (t)] + ρ_{2 i} (t) c_{2} [p_{g} (t) - x_{i} (t)] .

(4)

where,

ω

is the inertia weight,

ρ_{1 i} (t)

and

ρ_{2 i} (t)

are vectors of random values in

{[0, 1]}^{n}

, and

c_{1}

and

c_{2}

are acceleration coefficients that control cognitive and social influence. The position update rule is

x_{i} (t + 1) = x_{i} (t) + v_{i} (t + 1) .

(5)

After each update, the vectors

p_{i}

and

p_{g}

are evaluated to determine whether new personal or global best positions have been reached. A termination condition signals the end of the procedure once no significant improvement is possible.

The quantum-inspired particle swarm optimization (QPSO) algorithm extends PSO by incorporating principles from quantum mechanics [50]. QPSO addresses environments where the best solution may shift over time. Two additional control parameters appear in QPSO: s, the number of quantum particles, and r, the radius of a quantum cloud around the global best position. Quantum particles are sampled from a probability distribution centered at

p_{g} (t)

, and the update rule is

x_{i} (t + 1) = d (p_{g} (t), r_{c l o u d}),

(6)

where d denotes a probability distribution and

r_{c l o u d}

defines the quantum radius. Particles outside the quantum cloud follow the PSO update rules in Equations (4) and (5).

Both PSO and QPSO require suitable values for

ω

,

c_{1}

, and

c_{2}

. Harrison et al. (2017) [51] propose a self-adaptive procedure based on the stability condition of Poli and Broomhead (2007) [52],

\frac{22 - 30 ω^{2}}{7 - 5 ω} < c_{1} + c_{2} < \frac{24 - 24 ω^{2}}{7 - 5 ω} .

(7)

New values for

ω

,

c_{1}

, and

c_{2}

are selected after every k iterations, with

k = 5

recommended by Harrison et al. (2017) [51]. The additional QPSO parameters s, d, and

r_{c l o u d}

remain user defined. Blackwell and Bentley (2002) [50] suggest a uniform distribution for d, although Harrison et al. (2015) [53] show that a uniform distribution may perform poorly in certain dynamic environments. A uniform distribution is used in this study to maintain a simple and consistent baseline. Harrison et al. (2015) [53] also note that smaller values of

r_{c l o u d}

are preferable in environments with mild changes.

As part of the ensemble development process in this study, self-adaptive versions of PSO and QPSO adjust the contribution of each ensemble member in the voting mechanism. Each member receives a weight that determines its influence on the final prediction.

2.8. Comparison of Financial Time Series Techniques

Financial time series modeling spans a broad methodological landscape across econometric, machine learning (ML), deep learning (DL), and ensemble learning domains. Each domain introduces distinct assumptions, data requirements, and performance characteristics, which results in a diverse and often fragmented body of research. To establish a clear context for the methodological choices adopted in this study, Table 3 presents a comparative summary of representative contributions across these domains. The table outlines the modeling techniques used, the type of data analyzed, the primary task addressed, and the main outcomes reported. This structured overview provides a foundation for understanding the range of approaches applied to forecasting and classification tasks in FTS settings and clarifies the methodological gaps that motivate the unified evaluation conducted in the present study.

The studies in Table 3 reveal several consistent patterns across the FTS literature. Econometric models such as ARIMA and GARCH provide essential baselines for linear dynamics and volatility structure, although these models offer limited capacity for nonlinear behavior. ML and DL methods introduce greater flexibility, with evidence that classifier performance varies across data sets and that architectures such as LSTMs can exceed the performance of traditional baselines. Ensemble methods appear across all domains, which reflects a broad consensus that a combination of learners improves robustness and predictive accuracy. However, existing ensemble learning studies typically address isolated components such as pruning, weighting, or bootstrap diversity, rather than evaluating a complete ensemble design pipeline. As a result, prior research provides valuable but incomplete insights. The present study addresses this gap by conducting an end-to-end comparison of ML, DL, and econometric baselines within a unified ensemble modeling framework for FTS classification.

3. Empirical Process

The empirical process defines the sequence of steps required to transform raw financial time series into structured, statistically assessed, and consistently labeled data sets for subsequent model development. This process includes data acquisition, data quality assessment, data corrections, segmentation into intervals, stationarity analysis, entropy analysis, and class label construction. These steps establish a reproducible foundation for the evaluation of ensemble models. This section presents the workflow structure, the data sets used, the preprocessing procedures applied, and the metrics selected for performance evaluation.

3.1. Empirical Workflow Overview

Figure 2 presents a structured workflow that defines the complete empirical process used in this study. The workflow begins with raw FTS data and proceeds through a sequence of data preparation and analysis steps. These steps include a data quality assessment, the application of data corrections, the segmentation of each data set into intervals, and statistical assessments based on stationarity and entropy. A labeling procedure follows, together with an evaluation of the resulting class distributions.

The second stage of the workflow focuses on ensemble model development. Bootstrap sample generation provides the diversity required for bagging ensemble construction. Each ensemble is formed from identical base learners trained on distinct bootstrap samples. A metaheuristic optimization procedure then assigns a weight to each ensemble member to determine its contribution to the final prediction. Baseline models are trained on the same input data to provide reference performance.

The final stage of the workflow evaluates the predictive performance of the ensemble models. Performance metrics quantify accuracy and profit, and a comparative analysis contrasts ensemble performance with the performance of baseline models. This workflow establishes a consistent and reproducible process for all data sets and ensemble configurations used in this study.

3.2. Datasets

Six financial time series data sets form the basis of the empirical analysis. Each data set was obtained from the HistData repository (HistData.com provides historical market data for research and educational use. The platform specifies that the data are offered without warranty and may contain gaps or irregularities due to market conditions or data collection constraints), which supplies one-minute resolution price data for multiple asset classes. The selection includes two commodity series, two stock index series, and two exchange rate series to ensure diversity across market types and volatility regimes.

The data sets consist of Brent Crude Oil in United States Dollars (BCOUSD), Gold in United States Dollars (XAUUSD), the EURO STOXX 50 index in EUROs (ETXEUR), the Nikkei 225 index in Japanese Yen (JPXJPY), the United States Dollar to South African Rand exchange rate (USDZAR), and the Japanese Yen to South African Rand exchange rate (JPYZAR). Each data set contains five features, namely open, close, high, and low prices as well as the reported trading volume, all recorded at one-minute intervals. Table 4 summarizes the date ranges, total observations, number of one-day samples, and asset class categories for each series.

3.3. Data Preprocessing

The data preprocessing stage establishes the structural and statistical integrity required for all subsequent empirical analysis. This stage includes a data quality assessment and the application of corrective procedures, followed by stationarity and entropy evaluations, and the construction of class labels. Each component contributes to a consistent and reproducible preparation of FTS data sets.

3.3.1. Data Quality Assessments and Corrections

The data quality assessment identified three structural issues across all data sets. First, missing values appeared at irregular one-minute intervals for multiple features. Second, price observations were recorded during weekends despite the absence of active trading. Third, the volume feature contained no usable information, with extended sequences of zero values across all series.

Corrective procedures were applied to address these issues. Missing values were corrected through linear interpolation between the nearest valid observations to preserve the one-minute sampling structure [56]. Weekend observations were removed to align each series with standard market trading days [57]. The volume feature was excluded from further analysis due to the absence of reliable information. Each data set was then segmented into one-day intervals covering the period 09:00 to 17:00 to reduce computational requirements and to focus the analysis on periods with higher price variability [57].

3.3.2. Stationarity

The ADF [27] and KPSS [28] tests were applied to each data set at both the one-day level and the full-series level. Table 5 summarizes the outcomes. The results indicate weak stationarity across all series, with evidence of a unit root. This outcome suggests the presence of an underlying stochastic process that drives price evolution in each FTS. Such behavior is consistent with the influence of market forces, economic conditions, and geopolitical factors on asset prices [58].

3.3.3. Entropy

Shannon’s (1948) entropy measure was applied to each FTS, together with conditional entropy and mutual information as defined in later information-theoretic work [59]. Table 6 summarizes the results in units of information. The mutual information values indicate high levels of shared information across the open, close, high, and low price features. Conversely, the conditional entropy values indicate low levels of new information contributed by each feature. These outcomes are expected because all features represent different views of the same underlying price process. High mutual information reflects the shared structure of FTS, while low conditional entropy reflects the limited amount of unique information available from each feature.

3.3.4. Labeling

The labeling procedure defines a three-class classification structure for each FTS. The three classes correspond to buy, sell, and do-nothing actions, and each class reflects a directional decision for the next trading day. Framing the problem as a three-class classification task aligns with practical trading behavior, where the objective is not to forecast precise price levels, but to determine whether a position should be taken or avoided. This structure mirrors real-world execution choices and provides a stable alternative to point forecasting in noisy intra-day environments [57,60,61].

The construction of these classes relies on the distribution of daily price differences and on a set of financial indicators that provide directional signals. Figure 3 illustrates the distribution of daily differences between the open and close prices for the USDZAR exchange rate. Similar distributions were observed across all included data sets. Each distribution exhibits a near-normal shape with a high concentration of values around zero. This behavior aligns with the outcomes of the stationarity and entropy analyses. The weak stationarity observed in Section 3.3.2 implies that each series fluctuates around a slowly evolving mean, which contributes to the clustering of daily differences near zero. The entropy results further indicate high mutual information and low conditional entropy across features, which suggests that the open and close price series share substantial information and provide limited new information individually. These properties collectively explain the concentration of small daily movements and the approximate normality of the daily difference distributions.

The concentration of values near zero motivates the introduction of an offset region around the center of each distribution. This region defines a do-nothing class and excludes observations where the daily difference is too small to provide a reliable directional signal. The exclusion is justified by two considerations. First, the statistical properties of the series indicate that small positive and negative movements arise from similar underlying patterns, which reduces the ability of any model to distinguish between them. Second, common trading practice avoids directional decisions when price movements fall within a narrow range, since such movements do not justify a meaningful position. The offset therefore removes ambiguous cases and produces a clearer separation between buy and sell classes.

Directional labels outside the offset region are determined using three financial indicators, namely the average directional index (ADX) [62], the relative strength index (RSI) [63], and a moving average (MA) [64]. These indicators are widely used in practice and capture complementary aspects of market behavior, including trend strength, momentum, and mean reversion. Using three indicators provides a balance between signal diversity and interpretability, avoiding the instability that arises when relying on a single indicator or an overly complex indicator set. A buy or sell label is assigned when at least two indicators agree on the direction. This majority rule reduces the subjectivity associated with any single indicator and ensures that each label reflects a consistent and interpretable directional signal. A do-nothing label is assigned when the indicators do not agree or when the daily difference falls within the offset region. Figure 4 illustrates the ADX and RSI rules applied to determine regions for the buy, sell, and do-nothing directional labels.

Table 7 summarizes the offset values and resulting class distributions for each data set. The distributions exhibit varying degrees of class imbalance, which reflect the natural behavior of real-world financial markets rather than any artifact of the labeling procedure. No resampling techniques were applied, as oversampling or weighting would introduce synthetic patterns and distort the empirical structure of directional movements. Offset sensitivity analysis confirmed that class proportions remained stable across reasonable offset choices, supporting the robustness of the labeling strategy. The resulting labels provide a consistent and interpretable structure for the classification task presented in this study.

3.4. Evaluation Metrics

Two evaluation metrics are used to assess the performance of the ensemble models. These are a classification accuracy metric and a profit metric. These metrics provide complementary perspectives on model performance. The accuracy metric evaluates the correctness of the predicted class labels, while the profit metric evaluates the financial value of the predicted trading actions. To determine whether the ensemble model provides a statistically significant improvement over a baseline model, the accuracy and profit outcomes obtained across k cross-validation folds are compared using a set of paired statistical tests. These tests evaluate both the magnitude and the direction of the performance differences between the two models.

3.4.1. Classification Accuracy

The classification accuracy measures the proportion of correctly predicted labels relative to the total number of predictions. Let

y_{i}

denote the true class label for day i, and let

{\hat{y}}_{i}

denote the predicted class label. The accuracy is defined as

Accuracy = \frac{1}{N} \sum_{i = 1}^{N} I (y_{i} = {\hat{y}}_{i}),

where N is the total number of one-day samples and

I (\cdot)

is the indicator function. This metric provides a direct measure of the model’s ability to identify the correct trading action.

3.4.2. Profit Metric

The profit metric evaluates the realized financial outcome of the predicted trading actions. For each day i, let

p_{i}

denote the realized price difference between the open and close prices of the next trading day. A positive value of

p_{i}

indicates a profitable buy action, while a negative value indicates a profitable sell action. Let

a_{i}

denote the predicted action, where

a_{i} \in {buy, sell, do-nothing}

. The profit for day i is defined as

π_{i} = \{\begin{matrix} p_{i}, & if a_{i} = buy, \\ - p_{i}, & if a_{i} = sell, \\ 0, & if a_{i} = do-nothing . \end{matrix}

The total profit across all days is then

Profit = \sum_{i = 1}^{N} π_{i} .

This metric captures the cumulative financial value of the model’s predictions and provides an application-oriented evaluation aligned with trading practice.

3.4.3. Statistical Comparison of Model Performance

Several statistical tests are applied to evaluate whether the ensemble model significantly outperforms the baseline model across paired observations obtained from a k-fold cross-validation procedure. Let

d_{i}

denote the paired difference between the ensemble and baseline performance metrics for fold i, where

i = 1, \dots, k

.

Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test [65] evaluates whether the median of the paired differences obtained across the k cross-validation folds differs from zero. The test ranks the absolute paired differences

| d_{i} |

and assigns signs based on the direction of each difference. The test statistic is the sum of the signed ranks. This non-parametric test is appropriate for model comparison because it does not assume normality of the fold-level paired differences.

Sign Test

The sign test [66] evaluates whether the number of positive paired differences obtained across the k cross-validation folds exceeds the number expected under the null hypothesis of no performance difference. The test statistic is the count of positive signs in the paired differences. This test provides a distribution-free measure of directional dominance and does not require any assumptions regarding the distribution of the fold-level performance differences.

Cliff’s Delta

Cliff’s delta (

δ

) [67] measures the effect size by quantifying the degree to which the ensemble model outperforms the baseline model across the k cross-validation folds. It is defined as

δ = \frac{# (d_{i} > 0) - # (d_{i} < 0)}{k},

where

d_{i}

denotes the paired performance difference for fold i. Values close to 1 indicate strong dominance of the ensemble model, while values near 0 indicate negligible differences between the models.

Bootstrap Confidence Intervals

A non-parametric bootstrap procedure [37] is applied to estimate the confidence interval of the mean paired difference across the k cross-validation folds. The bootstrap resamples the k paired differences with replacement and computes the empirical distribution of the mean paired difference. The lower and upper bounds of the interval (CI Low, CI High) quantify the uncertainty associated with the estimated performance difference between the ensemble and baseline models.

Bayesian Posterior Probabilities

A Bayesian comparison [68] evaluates the posterior probability that the ensemble model outperforms the baseline model across the k cross-validation folds. Let

θ

denote the probability that the paired difference

d_{i}

is positive. A Beta prior is placed on

θ

, and the posterior distribution is obtained by updating the prior with the observed counts of positive and negative paired differences. The posterior probabilities quantify the likelihood that a specific model was superior, given the observed fold-level performance differences.

4. Model Development

The ensemble model design for this study is structured around four core components comprising the base models, the bootstrap methods used to generate resampled training sets, the voting mechanisms that aggregate member predictions, and the ensemble sizes evaluated. Each component is examined systematically to assess its contribution to overall ensemble performance.

4.1. Base Models

Five predictive models are developed for this study. Three of these models serve as ensemble base learners, namely a decision tree (DT) [69], a logistic regression (LR) [70], and a multi-layered perceptron (MLP) [71]. Two additional standalone baseline models are included for comparative evaluation, namely an autoregressive integrated moving average (ARIMA) model [2] and a long short-term memory (LSTM) network [72]. The combined set of models provide a diverse collection of linear, non-linear, parametric, and non-parametric approaches. This ensures that the ensemble and baseline comparisons span a broad methodological spectrum.

The DT, LR, and MLP models are selected as ensemble base learners due to their simplicity, computational efficiency, and complementary modeling characteristics. Each model forms the basis of an independent bagging ensemble, allowing the study to evaluate how bootstrap-based resampling interacts with different classes of predictive models. The DT provides a non-parametric, rule-based classifier capable of capturing local decision boundaries. The LR provides a linear probabilistic classifier that offers interpretability and robustness in high-dimensional settings. The MLP provides a non-linear neural network classifier capable of modeling complex feature interactions. This variety enables a systematic assessment of how model class influences the behavior of bootstrap ensembles in FTS classification [73]. Hyperparameter optimization is intentionally omitted for the base learners, as the objective of this study is to isolate and evaluate the effects of bootstrap resampling and voting mechanisms rather than to optimize individual model performance.

The key control parameters for all five models are summarized in Table 8. The three ensemble base learners use the default parameters of the scikit-learn library [74], except for the MLP, where the hidden layer size is set to 160 units (33.3% of the input dimensionality). The reduced hidden layer size decreases the number of trainable weights, thereby reducing computational cost while maintaining sufficient representational capacity [75].

The ARIMA model provides a classical linear time series baseline capable of modeling short-term autocorrelation structures. The model is fitted to the daily closing prices extracted from each FTS. The daily series is differenced once to remove non-stationarity. The Akaike Information Criterion (AIC) is used to guide model selection, where AIC provides a likelihood-based measure of model quality that penalizes excessive model complexity, thereby balancing goodness of fit with model simplicity [79]. An AIC-based grid search is performed over

p, q \in {0, 1, 2}

to select the ARIMA order

(p, 1, q)

. Multi-step forecasts are generated for the test horizon, and the forecasted price changes are converted into directional classes using a volatility-scaled offset. The offset is defined as

σ (δ)

, where

σ (δ)

denotes the standard deviation of the forecasted price changes. This offset introduces a neutral zone around zero, ensuring that small or uncertain forecasted movements are classified as no-action decisions, which aligns with the labeling strategy. The ARIMA model therefore provides a benchmark for linear autoregressive behavior in financial time series [80]. The performance of an ARIMA model depends directly on its order parameters. Therefore, control parameter tuning is required to obtain a well-specified linear benchmark, whereas the other baseline models operate in standard configurations that do not require structural hyperparameter selection.

The LSTM model provides a non-linear deep learning baseline capable of capturing short-term temporal dependencies. The network consists of a single LSTM layer with 32 hidden units followed by a dense output layer with three units and a softmax activation function. The hidden size of 32 units provides a lightweight architecture with sufficient capacity to model intra-day temporal structure while limiting the number of trainable parameters and reducing the risk of overfitting. The model is trained using the adam optimizer with a learning rate of 0.001 and the sparse categorical cross-entropy loss function. The input to the model is a fixed-length sequence of intra-day observations, and the output corresponds to the three directional classes used throughout the study. This model provides a benchmark for recurrent neural architectures commonly applied to FTS classification [55].

4.2. Bootstrap Methods

Six bootstrap methods for time series data are used in the development of the ensemble models. These comprise of the block, moving-block, sieve-block, Tukey, and stationary bootstrap procedures discussed in Section 2.6, and are implemented using the tsbootstrap package in Python 3.10.12. In addition to these established approaches, a sub-sampling method is developed and evaluated alongside the existing techniques. The block length used for generating bootstrap samples is set to 480, corresponding to the number of intra-day observations contained within a single day segment of each FTS.

4.3. Voting Mechanisms

Two voting mechanisms are compared, namely an equal-weighted majority voting scheme, in which each ensemble member contributes an identical vote, and an optimized weighted voting scheme, in which ensemble members are assigned weights that determine their relative influence [9]. The optimized voting mechanism is implemented using two metaheuristic algorithms, namely PSO and QPSO. The selection of PSO is informed by the comprehensive review of Rezk and Selim (2024) [18], while QPSO is included due to its ability to capture the dynamic and non-stationary characteristics inherent in FTS data.

The control parameters for both PSO and QPSO are adapted using a self-adaptive strategy based on the stability-guided approaches proposed by Harrison et al. (2015, 2017) [51,53], as discussed in Section 2.7. The remaining algorithm-specific parameters are listed in Table 9.

In addition, two loss functions are used independently as objective functions within the PSO and QPSO optimization processes to evaluate ensemble performance under different criteria. The entropy-based loss function aligns with directional accuracy by penalizing incorrect predictions uniformly, whereas the profit-based loss function evaluates the realized profit or loss associated with each prediction, thereby capturing model profitability. The motivation for employing two distinct loss functions follows from Figure 3, which illustrates the non-linear relationship between accuracy and profitability.

4.4. Ensemble Sizes

Five variations of each ensemble model are developed, consisting of ensembles with 10, 30, 50, 100, and 150 members, respectively. These sizes allow an examination of how ensemble performance evolves as the number of constituent members increases, ranging from small ensembles to more computationally intensive configurations.

4.5. Training Procedure

Three ensemble models are developed using DT, LR, and MLP base learners combined with six bootstrap methods. Each base model–bootstrap combination is evaluated across five ensemble sizes of 10, 30, 50, 100, and 150 members. Model training and evaluation are performed using the time series cross-validation procedure available in the scikit-learn package [74], with ten cross-validation folds applied for each voting mechanism. The choice of ten folds provides a balance between computational efficiency and robust performance estimation across the temporal structure of each FTS.

Three voting mechanisms are examined, namely equal-weighted majority voting, and PSO- and QPSO-based optimized voting. After each permutation run, the accuracy and profit metrics are recorded for subsequent analysis. Results are summarized using a wins-based approach, where wins correspond to the highest mean performance across datasets for a given metric. Standard deviations are used to confirm the robustness of each winner, with lower variability indicating more stable performance. Wins therefore identify the best average accuracy and best average profitability achieved across the experimental configurations.

The baseline models are trained using the same time series cross-validation procedure applied to the ensembles, ensuring that all models are evaluated under an identical temporal structure. Each baseline model is fitted on the training portion of each fold and assessed on the corresponding test segment, preserving chronological order throughout. Standard configurations are used for all baseline methods, with ARIMA being the only model requiring structural tuning due to its dependence on order parameters. This alignment with the ensemble training process provides a consistent and comparable evaluation framework across all models.

5. Ensemble Results

This section presents and discusses the empirical results of the ensemble framework. The overall performance of the ensemble models is evaluated across base learners, bootstrap methods, voting mechanisms, ensemble sizes, and objective functions. Detailed analyses are then provided to examine how each design component influences predictive accuracy and profitability. This offers a comprehensive view of the factors that shape ensemble performance for the FTS included in this study.

5.1. Overall Performance

The LR ensemble model provides the strongest overall performance across all data sets, bootstrap methods, voting mechanisms, and ensemble sizes, as shown in Table 10 and Table 11. The LR ensemble consistently outperforms the DT and MLP ensembles on both the accuracy and profit metrics, with the exception of the USDZAR data set, where the MLP ensemble achieves a marginally higher profitability. These results indicate that the LR ensemble is able to capture the dominant directional structure of an FTS more reliably than the DT and MLP ensembles.

The superior performance of the LR ensemble can be explained by the bias–variance characteristics of the underlying learners. LR is a high-bias, low-variance model [73], and when combined with bootstrap aggregation, its stable decision boundary becomes more robust to the noise and microstructure irregularities present in intra-day financial data [81]. In contrast, the MLP is a low-bias, high-variance model whose performance is sensitive to hyperparameter tuning [75]. Without extensive tuning, the MLP tends to overfit short-lived fluctuations that do not generalize across bootstrap samples or cross-validation folds. The DT ensemble exhibits similar behavior. Although bagging reduces variance, the underlying tree structure remains sensitive to small perturbations in the data [69], which limits its ability to generalize in weakly stationary environments.

These observations align with the statistical properties of the data sets analyzed in Section 3.3.2 and Section 3.3.3. The weak stationarity and high mutual information across features imply that the directional signal is relatively smooth and dominated by broad, persistent tendencies rather than complex nonlinear interactions [82]. In such settings, linear decision boundaries often outperform more flexible nonlinear models, particularly when the latter are not extensively tuned [55]. The LR ensemble therefore benefits from a favorable alignment between model structure and the underlying data-generating process.

The Tukey bootstrap method provides the strongest overall performance across the data sets, as shown in Table 12. The tapering applied at block boundaries produces smoother transitions between adjacent segments, which better preserves the local autocorrelation and volatility clustering inherent in FTS [41]. Methods with hard block boundaries introduce artificial discontinuities that distort short-term temporal structure. The stationary bootstrap performs well on some data sets, reflecting the suitability of its block-based resampling for certain local patterns, but the Tukey method provides the most consistent performance across both accuracy and profitability.

Table 12 also summarizes the performance of the voting mechanisms. The optimized weighted voting mechanisms (PSO and QPSO) outperform equal-weighted majority voting on the profitability metric across all data sets. This behavior reflects the ability of PSO and QPSO to identify weight configurations that emphasize ensemble members capturing rare but profitable directional movements [18,83]. The QPSO algorithm benefits from a more global search capability, which reduces the likelihood of converging to local optima in the profit landscape [83].

The preferred ensemble size across the FTS data sets is 50, as shown in Table 13. This result aligns with ensemble theory, which shows that increasing the number of members reduces variance up to a saturation point, after which additional members contribute diminishing returns due to increasing correlation among bootstrap samples [73]. In FTS, where directional signals are weak and noisy, excessively large ensembles may oversmooth the signal and reduce sensitivity to rare but profitable deviations. The diversity observed across data sets reflects underlying differences in market microstructure and volatility regimes, where each instrument exhibits distinct patterns of liquidity, noise, and volatility clustering that influence ensemble behavior [81].

Table 13 also highlights the contrasting behavior of the entropy and profit loss functions. The entropy loss function consistently yields higher accuracy, as it rewards correct predictions uniformly and therefore encourages the optimization algorithms to maximize classification performance [84]. In contrast, the profit loss function prioritizes trades with higher financial impact, even if this results in lower overall accuracy. This divergence reflects an established property of financial prediction tasks, accuracy and profitability are not linearly related [85]. A model may achieve modest accuracy while still capturing a small number of highly profitable directional movements. The overall findings reinforce the importance of optimizing ensemble voting weights, as this enables an ensemble to target profitability more effectively than accuracy-driven configurations.

5.2. Performance of Bootstrap Methods

Table 14 summarizes the performance of the six bootstrap methods across the three ensemble model types. The totals reflect the number of data sets (six in total) for which a bootstrap method achieved the best performance under a given metric. The results show a clear preference for the Tukey method in both the DT and MLP ensembles, where it dominates across accuracy and profit. This behavior is consistent with the overall findings in Section 5.1, where the Tukey method frequently produced the most stable and profitable ensembles.

The LR ensembles exhibit a more heterogeneous pattern. For accuracy, the subsample method is preferred, while the profit metric shows no single dominant bootstrap method. This divergence reflects the sensitivity of LR ensembles to the structure of the resampled data, where different bootstrap methods preserve different aspects of the underlying temporal dependencies. Overall, the results indicate that the Tukey method is the most robust choice for DT and MLP ensembles, but not necessarily for LR ensembles.

Table 15 examines bootstrap performance from the perspective of the voting mechanism. Each metric contains 36 observations, corresponding to six data sets evaluated under six bootstrap methods. Unlike the model-specific results, no bootstrap method consistently dominates across the voting mechanisms. This suggests that the choice of bootstrap method does not materially influence the behavior of the voting mechanism itself. Instead, the voting mechanism appears to respond primarily to the ensemble’s predictive structure rather than the resampling scheme used to generate its members.

Table 16 evaluates bootstrap performance across ensemble sizes. The Tukey method again performs strongly, achieving the highest number of wins across nearly all ensemble sizes for both accuracy and profit. The stationary bootstrap is the only meaningful competitor, particularly at larger ensemble sizes, and highlights its ability to preserve short-range dependencies. These results indicate that the Tukey and stationary methods are the most effective at generating diverse yet structurally coherent bootstrap samples, which in turn support more stable ensemble performance.

5.3. Performance of Voting Mechanisms

Table 15 summarizes the performance of the voting mechanisms across bootstrap methods. The accuracy results show a slight preference for the equal-weighted majority voting mechanism, although the combined performance of the PSO and QPSO mechanisms indicates that optimized voting weights frequently outperform majority voting. This pattern suggests that while majority voting provides a stable baseline, optimized weighting can yield additional gains when the underlying models benefit from differential contribution strengths.

Table 17 further illustrates this behavior across model types. QPSO is the most effective voting mechanism overall, achieving the highest number of wins for both accuracy and profitability. The DT ensembles are an exception, where majority voting performs best under the accuracy metric. This indicates that DT ensembles benefit from uniform weighting, likely due to their high variance and the stabilizing effect of equal contributions. In contrast, LR and MLP ensembles benefit more from optimized weighting, where QPSO consistently identifies more effective voting configurations.

Table 18 examines performance across ensemble sizes. The PSO mechanism is preferred for most ensemble sizes under both accuracy and profit, indicating that optimized weighting becomes increasingly beneficial as the ensemble grows. Majority voting remains competitive for accuracy, particularly at smaller ensemble sizes, but does not match the profitability achieved by PSO or QPSO. These results suggest that optimization plays a more important role when ensembles become larger and more diverse, where uniform weighting may fail to capture the relative strengths of individual members.

Table 19 evaluates the loss functions used within the PSO and QPSO mechanisms. As expected, the entropy loss function leads to higher accuracy, since the optimization process is designed to maximize the number of correct predictions. Conversely, the profit loss function yields higher profitability, as the search process explicitly targets profitable directional movements. The consistency of these results across both PSO and QPSO indicates that the choice of loss function is the primary determinant of whether the ensemble prioritizes accuracy or profitability.

The broader results in Table 20, Table 21 and Table 22 reinforce this pattern. The entropy loss function aligns strongly with the accuracy metric, while the profit loss function aligns with profitability. Instances where the profit loss function also improves accuracy indicate that the optimization process has identified configurations that simultaneously enhance predictive correctness and financial performance. Such outcomes are particularly desirable because the choice of objective function determines whether the optimization process prioritizes predictive accuracy or financial performance.

5.4. Performance Impact on Ensemble Sizes

As discussed in Section 5.1, ensembles with 50 members provide the strongest overall performance across the majority of data sets. Table 23 offers additional insight by comparing model types across ensemble sizes. The LR ensembles achieve the highest accuracy at all ensemble sizes, while the MLP ensembles rival LR performance only in terms of profitability. The LR ensembles also show marginally stronger profitability at larger ensemble sizes, suggesting that linear decision boundaries benefit from the increased stability associated with larger ensembles [86].

Table 13 further shows that although 50 member ensembles perform well on average, the preferred ensemble size differs between the accuracy and profitability metrics. Ensemble sizes below 50 do not appear among the accuracy winners, yet they do appear among the profitability winners. This divergence reflects the fact that ensemble size influences different performance metrics in distinct ways. Larger ensembles tend to reduce variance and improve accuracy, while smaller ensembles may preserve directional characteristics that contribute to higher profitability [87].

The results in Table 16 and Table 18 show no strong dependency between ensemble size and the choice of bootstrap method or voting mechanism. However, these tables reinforce the broader patterns observed earlier. The Tukey bootstrap method remains the most effective across ensemble sizes, and PSO-based weighted voting mechanisms consistently outperform majority voting in profitability. These findings align with the general principle that ensemble performance depends not only on the number of members but also on the diversity and weighting of those members [88].

6. Statistical Evaluation of Ensembles vs. Base Line Models

The statistical evaluation in this section quantifies the performance differences between the developed ensemble models and the baseline models introduced in Section 4.1 across all six FTS data sets. The aim is to determine whether the observed improvements in accuracy and profit reflect consistent differences rather than random variation. To support this, a set of non-parametric statistical tests is applied to the paired performance results, and these tests report both the direction and the magnitude of the differences. These tests form the basis for the comparative results presented in this section.

6.1. Formulation of Statistical Tests

The statistical evaluation in this study assesses whether the ensemble models offer consistent improvements over the baseline models introduced in Section 4.1. The baselines include ARIMA, DT, LR, MLP, and LSTM models, which cover a range of approaches commonly applied to FTS classification. The ensemble configurations identified in Section 5 and listed in Table 24 are evaluated against these baselines to determine whether the observed differences in accuracy and profit are statistically significant.

A suite of non-parametric statistical tests defined in Section 3.4.3 is applied to the paired performance differences between each ensemble configuration and its corresponding baseline model across all six data sets. The tests include the Wilcoxon signed-rank test, the sign test, Cliff’s delta, and bootstrap confidence intervals. Each reports information on the direction and magnitude of the differences. The Bayesian signed-rank test is also used to estimate the posterior probability that an ensemble model outperforms its baseline counterpart. This combination of tests provides a structured basis for the comparative analysis.

6.2. Findings for the Accuracy Metric

This section interprets the statistical results for the accuracy metric and explains the factors that contribute to the observed performance differences. The ensemble models incorporate several sources of diversity through bootstrap resampling, varying ensemble sizes, and optimized voting mechanisms. These elements allow the ensembles to capture a broader range of patterns within FTS data than the individual baseline models. The optimized voting mechanisms also allow an ensemble to weight members according to their contribution to predictive performance, which reduces the influence of weaker members and improves overall accuracy. The statistical evaluation therefore indicates whether these design choices lead to consistent performance differences rather than outcomes driven by random variation. A summary of the statistical results for the accuracy metric is provided in Table 25, and the detailed outputs of the statistical tests are presented in Table A1 in Appendix A, Appendix A.1.

Across all six data sets, the accuracy-optimal ensemble configuration demonstrates strong and consistent improvements over the baseline models. This is particularly evident in the BCOUSD data set, where the ensemble achieves very low Wilcoxon signed-rank p-values (

p = 0.0020

) against both ARIMA and DT, accompanied by large effect sizes (Cliff’s

δ = 1.0

) and positive confidence intervals. The Bayesian posterior probabilities further support these findings, with the ensemble achieving values as high as 91.7% against multiple baselines. Similar patterns are observed for ETXEUR and USDZAR, where the ensemble again achieves strong evidence of superiority, reflected by Wilcoxon p-values below 0.0040 and effect sizes above 0.88 for several baselines.

In contrast, the few cases where a baseline model shows a higher posterior probability, primarily the LR baseline model and, in isolated instances, the LSTM model. These are not supported by strong statistical evidence. For example, in the JPXJPY data set, LSTM model attains a Bayesian posterior probability of 58.3%, but this result is accompanied by a Wilcoxon p-value of 0.9219, a negligible effect size (Cliff’s

δ = 0.03

), and a confidence interval that spans both positive and negative values. Similar patterns are observed in the JPYZAR and XAUUSD data sets, where LR baseline model occasionally shows a slight advantage, but the confidence intervals cross zero and the effect sizes remain small.

These results indicate that the ensemble models provide reliable and stable improvements in accuracy across diverse financial markets. The statistical evidence consistently favors the ensemble models, while baseline advantages are limited, inconsistent, and not supported by strong statistical indicators.

6.3. Findings for the Profit Metric

This section interprets the statistical results for the profit metric and examines the factors that contribute to the observed performance differences. The ensemble models incorporate several design elements that influence profit-based performance, including bootstrap resampling to introduce diversity, the selection of ensemble sizes that balance variance reduction and model stability, and optimized voting mechanisms that weight ensemble members according to their contribution to profit. These choices enable the ensembles to capture profit-relevant patterns within FTS data that may not be fully exploited by the individual baseline models. A summary of the statistical results for the profit metric is provided in Table 26, and the detailed outputs of the statistical tests are presented in Table A2 in Appendix A, Appendix A.2.

Across all six FTS data sets, the ensemble models demonstrate consistent and statistically supported improvements in profit relative to the baseline models. This is particularly evident in the BCOUSD and JPXJPY data sets, where the ensembles achieve very low Wilcoxon signed-rank p-values (

p = 0.0020

), large effect sizes (Cliff’s

δ

ranging from 0.58 to 1.0), and strictly positive confidence intervals. The Bayesian posterior probabilities further support these results, with the ensembles achieving values of 91.7% for BCOUSD and JPXJPY. Similar patterns appear in USDZAR, where the ensemble again shows strong evidence of superiority, reflected by Wilcoxon p-values of

p = 0.0020

, effect sizes above 0.90, and positive confidence intervals.

In contrast, the few cases where a baseline model shows a higher posterior probability are not supported by strong statistical evidence. For example, in the JPYZAR data set, the LR baseline model attains a Bayesian posterior probability of 66.7%, but this result is accompanied by a Wilcoxon p-value of 0.3750, a negative effect size (Cliff’s

δ = - 0.14

), and a confidence interval that spans zero. Similar patterns appear in the ETXEUR and XAUUSD data sets, where the LR baseline model occasionally shows a slight advantage, but the confidence intervals cross zero and the effect sizes remain small.

These results show that the ensemble models achieve clear and consistent gains in profit across the FTS considered. The statistical indicators support these outcomes, while the few apparent baseline advantages lack consistent evidence and do not reflect systematic performance differences.

7. Conclusions and Future Work

This study addressed the need for a comprehensive, end-to-end evaluation of bagging ensemble models for financial time series (FTS) classification. It responded to gaps in the literature related to the interaction of bootstrap methods, ensemble sizes, voting mechanisms, and loss functions. The work examined the full modeling pipeline, from data preprocessing and the construction of a supervised classification problem to the design and evaluation of ensemble configurations across six diverse FTS data sets. The empirical analysis incorporated decision tree (DT), logistic regression (LR), and multi-layer perceptron (MLP) base learners, six time series bootstrap methods, five ensemble sizes, and three voting mechanisms, with additional analysis of the role of entropy- and profit-based loss functions within particle swarm (PSO) and quantum-inspired particle swarm (QPSO) optimization.

The results of this study show that LR-based ensembles provide the strongest overall performance across the six FTS data sets, outperforming the ARIMA, DT, LR, MLP, and LSTM baseline models on both accuracy and profit metrics. The statistical evaluation supports these outcomes, with the ensemble models achieving consistently positive confidence intervals, large effect sizes, and high Bayesian posterior probabilities across most comparisons. Apparent baseline advantages occur only in isolated cases and lack strong statistical support. The choice of the bootstrap method affects performance in model-specific ways. DT and MLP ensembles show their best results under the Tukey bootstrap, while LR ensembles achieve strong performance under the block bootstrap, the sub-sample bootstrap method, and the Tukey bootstrap method. The evaluation also shows that optimized voting mechanisms offer clear advantages over equal-weight majority voting, with the profit-based loss function producing the most consistent gains in this study. The analysis of ensemble size further indicates that FTS classification problems exhibit an optimal range of ensemble members, as larger ensembles do not always yield additional improvements and may reduce performance in certain cases. These findings collectively show that ensemble performance depends on the interaction of bootstrap diversity, ensemble size, and voting strategy, and that careful design choices are necessary to achieve reliable improvements in FTS classification.

Practical Implications. The findings of this study offer several practical insights for FTS practitioners. First, LR-based ensembles provide a reliable and interpretable foundation for directional classification tasks across diverse market conditions. Second, the selection of bootstrap methods should reflect the characteristics of the base learner. Tukey provides strong and consistent performance for DT and MLP ensembles, while LR ensembles achieve their best results under the block bootstrap, the sub-sample bootstrap, and the Tukey bootstrap method. As a result, no single resampling strategy dominates across all learners. Third, optimized voting mechanisms offer clear advantages over equal-weight majority voting, and profit-oriented loss functions provide the most consistent improvements in this study. Finally, the identification of optimal ensemble sizes highlights the importance of balancing diversity with computational efficiency, especially in real-time or resource-constrained environments.

Limitations. Several limitations should be acknowledged. The analysis is restricted to six FTS data sets, which, although diverse, do not capture the full range of market regimes or structural characteristics. The study also focuses on three base learners and a specific family of optimization algorithms, leaving open the question of how alternative model classes or optimization strategies might behave under similar ensemble designs. In addition, the profit metric used in this study does not incorporate transaction costs or market frictions, which may influence real-world applicability. These limitations provide opportunities for further investigation.

Future Work. Future research may extend this study in several directions. One avenue is the incorporation of kernel-based methods to explore nonlinear extensions of LR ensembles and to assess their effect on model bias and predictive stability. Another direction involves variation in the number of iterations in the self-adaptive PSO and QPSO algorithms to better understand their convergence behavior under different loss functions. A multi-objective optimization approach that combines entropy and profit loss functions may also produce more balanced ensemble designs. Further work may examine alternative sampling distributions for the quantum cloud used in QPSO and expand the empirical evaluation to include additional FTS data sets with different structural characteristics. These directions also address several limitations of the current study, which include the use of a finite set of FTS data sets, a focus on three base learners, and the exclusion of transaction costs and market frictions from the profit metric. By broadening the empirical scope and exploring additional modeling components, future research would deepen the understanding of ensemble behavior in FTS classification and strengthen the practical relevance of the results.

Author Contributions

The individual contributions of the author A.N. are conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, and visualization. The individual contributions of the author A.P.E. are supervision and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study did not require any Institutional Review Board approvals.

Data Availability Statement

The original data presented in the study are openly available in HistData.com, at https://www.histdata.com/download-free-forex-data/?/excel/1-minute-bar-quotes (accessed on 24 June 2024).

Acknowledgments

During the preparation of this manuscript, the authors used Microsoft Copilot (April 2026 version) for the purposes of manuscript refinement, phrasing adjustment, and editorial clarity. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GAF	Gramian angular fields
MTF	Markov transition fields
FTS	Financial time series
NFLT	No free lunch theorem
ADF	Augmented Dicker–Fuller test
KPSS	Kwiatkowski–Phillips–Schmidt–Shin test
CE	Conditional entropy
MI	Mutual information
ADX	Average directional index
RSI	Relative strength index
SMA	Simple moving average
EMA	Exponential moving average
MACD	Moving average convergence divergence
MBB	Moving block bootstrap
PSO	Particle swarm optimization algorithm
QPSO	Quantum inspired particle swarm optimization algorithm
BCOUSD	Brent crude oil price in United States Dollars
ETXEUR	EURO STOXX 50 index in EUROs
JPXJPY	Nikkie stock average in Japanese Yen
JPYZAR	South African Rand and Japanese Yen exchange rate
USDZAR	South African Rand and United States Dollar exchange rate
XAUUSD	Gold price in United States Dollars
ML	Machine Learning
DL	Deep Learning
DT	Decision Tree
LR	Logistic Regression
MLP	Multi-layered perceptron
ARIMA	Autoregressive integrated moving average
AIC	Akaike information criterion
GARCH	Generalized autoregressive conditional heteroskedasticity
LSTM	Long short-term memory network
CI	Bootstrap confidence interval

Appendix A. Ensemble vs. Baseline Models Statistical Tests

The tables presented in this appendix contain the results obtained from the statistical tests in Section 3.4 for the accuracy and profit model pairings. The Wilcoxon signed-rank test (Wilcoxon p) and the sign test (Sign p) evaluate whether the ensemble model significantly outperforms the baseline across paired observations. Cliff’s delta (

δ

) measures the effect size, where values close to 1 indicate strong dominance of the ensemble and values near 0 indicate negligible differences. The bootstrap confidence interval (CI Low, CI High) reflects the uncertainty around the mean paired difference. The Bayesian columns (Bayes Ens, Bayes Base) report the posterior probability that the ensemble or baseline model is better, respectively. The larger probability is highlighted in bold to indicate the more likely winner.

Appendix A.1. Accuracy Statistical Analysis

Table A1. Statistical comparison between the accuracy-optimal ensemble and baseline models across all datasets. Bayesian probabilities are expressed as percentages. Values truncated to four decimal places.

Dataset	Baseline	Wilcoxon p	Sign p	Cliff’s $δ$	CI Low	CI High	Bayes Ens	Bayes Base
BCOUSD	ARIMA	0.0020	0.0020	1.0000	0.2360	0.3254	91.7%	8.3%
BCOUSD	DT	0.0020	0.0020	1.0000	0.1019	0.1482	91.7%	8.3%
BCOUSD	LR	0.3621	0.7266	0.0300	−0.0023	0.0064	60.0%	40.0%
BCOUSD	MLP	0.0840	0.3438	0.5400	0.0270	0.1502	66.7%	33.3%
BCOUSD	LSTM	0.1230	0.0703	0.1600	−0.0003	0.0225	80.0%	20.0%
ETXEUR	ARIMA	0.0020	0.0020	1.0000	0.2026	0.3361	91.7%	8.3%
ETXEUR	DT	0.0039	0.0215	0.8800	0.0397	0.0814	83.3%	16.7%
ETXEUR	LR	0.7695	0.7539	−0.0500	−0.0180	0.0098	58.3%	41.7%
ETXEUR	MLP	0.0020	0.0020	1.0000	0.1897	0.2629	91.7%	8.3%
ETXEUR	LSTM	0.9219	1.0000	−0.0400	−0.0263	0.0237	50.0%	50.0%
JPXJPY	ARIMA	0.0020	0.0020	0.9600	0.0479	0.1193	91.7%	8.3%
JPXJPY	DT	0.0020	0.0020	0.9900	0.0576	0.0939	91.7%	8.3%
JPXJPY	LR	0.5735	0.7266	0.0300	−0.0023	0.0035	60.0%	40.0%
JPXJPY	MLP	0.0371	0.3438	0.5500	0.0109	0.0547	66.7%	33.3%
JPXJPY	LSTM	0.9219	0.7539	0.0300	−0.0270	0.0257	41.7%	58.3%
JPYZAR	ARIMA	0.0020	0.0020	1.0000	0.3148	0.3952	91.7%	8.3%
JPYZAR	DT	0.0020	0.0020	0.8300	0.0688	0.1305	91.7%	8.3%
JPYZAR	LR	0.9056	1.0000	−0.0600	−0.0074	0.0080	45.5%	54.5%
JPYZAR	MLP	0.0098	0.1094	0.8200	0.1077	0.2476	75.0%	25.0%
JPYZAR	LSTM	0.6250	0.7539	0.0400	−0.0238	0.0264	58.3%	41.7%
USDZAR	ARIMA	0.0020	0.0020	1.0000	0.1042	0.1421	91.7%	8.3%
USDZAR	DT	0.0020	0.0020	0.9800	0.0540	0.0900	91.7%	8.3%
USDZAR	LR	0.5566	1.0000	0.1200	−0.0045	0.0106	50.0%	50.0%
USDZAR	MLP	0.0020	0.0020	0.9600	0.0768	0.1251	91.7%	8.3%
USDZAR	LSTM	0.1602	0.3438	−0.3000	−0.0376	0.0032	33.3%	66.7%
XAUUSD	ARIMA	0.0020	0.0020	1.0000	0.1931	0.3431	91.7%	8.3%
XAUUSD	DT	0.0020	0.0020	1.0000	0.1071	0.1366	91.7%	8.3%
XAUUSD	LR	0.9219	0.7539	−0.0200	−0.0071	0.0069	41.7%	58.3%
XAUUSD	MLP	0.0059	0.0215	0.8100	0.0640	0.1514	83.3%	16.7%
XAUUSD	LSTM	0.0098	0.1094	0.4300	0.0069	0.0380	75.0%	25.0%

Note: Bold values indicate the best-performing model determined by the Bayesian probabilities for each FTS data set.

Appendix A.2. Profit Statistical Analysis

Table A2. Statistical comparison between the profit-optimal ensemble and baseline models across all datasets. Bayesian probabilities are expressed as percentages. Confidence intervals truncated to two decimal places.

Dataset	Baseline	Wilcoxon p	Sign p	Cliff’s $δ$	CI Low	CI High	Bayes Ens	Bayes Base
BCOUSD	ARIMA	0.0020	0.0020	1.0000	197.90	282.71	91.7%	8.3%
BCOUSD	DT	0.0020	0.0020	0.7800	69.82	113.40	91.7%	8.3%
BCOUSD	LR	0.0645	0.3438	0.0400	0.59	8.71	66.7%	33.3%
BCOUSD	MLP	0.0137	0.0215	0.4600	40.71	174.38	83.3%	16.7%
BCOUSD	LSTM	0.0059	0.0215	0.1800	7.59	24.18	83.3%	16.7%
ETXEUR	ARIMA	0.0059	0.0215	0.9200	714.89	1571.84	83.3%	16.7%
ETXEUR	DT	0.2324	0.7539	0.2400	−80.83	502.17	58.3%	41.7%
ETXEUR	LR	0.4922	0.7539	−0.0400	−84.11	18.61	41.7%	58.3%
ETXEUR	MLP	0.0039	0.0215	0.9000	574.52	1326.53	83.3%	16.7%
ETXEUR	LSTM	0.3750	1.0000	−0.1800	−451.80	127.66	50.0%	50.0%
JPXJPY	ARIMA	0.0020	0.0020	1.0000	15,461.43	24,066.83	91.7%	8.3%
JPXJPY	DT	0.0020	0.0020	0.6200	6176.70	11,569.86	91.7%	8.3%
JPXJPY	LR	0.0645	0.1094	0.1200	271.66	2650.30	75.0%	25.0%
JPXJPY	MLP	0.0020	0.0020	1.0000	13,353.45	22,951.74	91.7%	8.3%
JPXJPY	LSTM	0.0020	0.0020	0.5800	4556.76	9004.21	91.7%	8.3%
JPYZAR	ARIMA	0.0020	0.0020	1.0000	4.18	8.14	91.7%	8.3%
JPYZAR	DT	0.0273	0.1094	0.4200	0.38	2.94	75.0%	25.0%
JPYZAR	LR	0.3750	0.3438	−0.1400	−0.43	0.18	33.3%	66.7%
JPYZAR	MLP	0.0195	0.1094	0.7400	3.01	8.49	75.0%	25.0%
JPYZAR	LSTM	0.1309	0.1094	0.1400	−0.05	1.65	75.0%	25.0%
USDZAR	ARIMA	0.0020	0.0020	1.0000	9.51	16.79	91.7%	8.3%
USDZAR	DT	0.0039	0.0215	0.5400	2.80	7.03	83.3%	16.7%
USDZAR	LR	0.0488	0.1094	0.2800	0.45	3.51	75.0%	25.0%
USDZAR	MLP	0.0020	0.0020	0.9000	8.97	16.20	91.7%	8.3%
USDZAR	LSTM	0.0137	0.1094	0.4600	1.78	6.16	75.0%	25.0%
XAUUSD	ARIMA	0.0020	0.0020	1.0000	953.47	1333.26	91.7%	8.3%
XAUUSD	DT	0.0098	0.1094	0.2600	118.10	320.81	75.0%	25.0%
XAUUSD	LR	0.1934	0.3438	−0.0800	−68.03	4.82	33.3%	66.7%
XAUUSD	MLP	0.0059	0.0215	0.7600	432.88	1073.47	83.3%	16.7%
XAUUSD	LSTM	0.0488	0.1094	0.1600	15.86	162.87	75.0%	25.0%

Note: Bold values indicate the best-performing model determined by the Bayesian probabilities for each FTS data set.

References

Kamolov, S.; Iskhakov, D.; Ziyaev, B. Machine learning methods in time series forecasting: A review. Ann. Math. Comput. Sci. 2021, 2, 10–14. [Google Scholar] [CrossRef]
Box, G. Science and statistics. J. Am. Stat. Assoc. 1976, 71, 791–799. [Google Scholar] [CrossRef]
Hall, T.; Rasheed, K. A Survey of Machine Learning Methods for Time Series Prediction. Appl. Sci. 2025, 15, 5957. [Google Scholar] [CrossRef]
Wang, Z.; Oates, T. Imaging Time-Series to Improve Classification and Imputation. In Proceedings of the 24th International Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2015; pp. 3939–3945. [Google Scholar]
Appelbaum, M.; Tsokos, C.P. A heuristic method for estimating time-series models for forecasting. Appl. Math. Comput. 1985, 16, 265–275. [Google Scholar] [CrossRef]
Stefenon, S.F.; Ribeiro, M.H.D.M.; Nied, A.; Yow, K.; Mariani, V.C.; dos Santos Coelho, L.; Seman, L.O. Time series forecasting using ensemble learning methods for emergency prevention in hydroelectric power plants with dam. Electr. Power Syst. Res. 2022, 202, 107584. [Google Scholar] [CrossRef]
Tang, Y.; Song, Z.; Zhu, Y.; Yuan, H.; Hou, M.; Ji, J.; Tang, C.; Li, J. A survey on machine learning models for financial time series forecasting. Neurocomputing 2022, 512, 363–380. [Google Scholar] [CrossRef]
Wolpert, D.H.; Macready, W.G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef]
Mienye, D.; Sun, Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
Zhou, Z.H. Encyclopedia of Biometrics; Springer: Berlin/Heidelberg, Germany, 2009; Chapter Ensemble Learning; pp. 270–273. [Google Scholar]
Hospedales, T.M.; Antoniou, A.; Micaelli, P.; Storkey, A.J. Meta-Learning in Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5149–5169. [Google Scholar] [CrossRef]
Soto, J.; Melin, P.; Castillo, O. A New Approach for Time Series Prediction Using Ensembles of IT2FNN Models with Optimization of Fuzzy Integrators. Int. J. Fuzzy Syst. 2018, 20, 701–728. [Google Scholar] [CrossRef]
He, K.; Yang, Q.; Ji, L.; Pan, J.; Zou, Y. Financial Time Series Forecasting with the Deep Learning Ensemble Model. Mathematics 2023, 11, 1054. [Google Scholar] [CrossRef]
Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
Sakib, M.; Mustajab, S.; Alam, M. Ensemble deep learning techniques for time series analysis: A comprehensive review, applications, open issues, challenges, and future directions. Clust. Comput. 2024, 28, 73. [Google Scholar] [CrossRef]
Martínez-Muñoz, G.; Hernández-Lobato, D.; Suárez, A. An Analysis of Ensemble Pruning Techniques Based on Ordered Aggregation. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 245–259. [Google Scholar] [CrossRef]
Ma, Z.; Dai, Q.; Liu, N. Several novel evaluation measures for rank-based ensemble pruning with applications to time series prediction. Expert Syst. Appl. 2015, 42, 280–292. [Google Scholar] [CrossRef]
Rezk, S.; Selim, K. Metaheuristic-based ensemble learning: An extensive review of methods and applications. Neural Comput. Appl. 2024, 36, 17931–17959. [Google Scholar] [CrossRef]
Zhang, G.P. A neural network ensemble method with jittered training data for time series forecasting. Inf. Sci. 2007, 177, 5329–5346. [Google Scholar] [CrossRef]
Rahman, M.; Islam, M.; Murase, K.; Yao, X. Layered Ensemble Architecture for Time Series Forecasting. IEEE Trans. Cybern. 2015, 46, 270–283. [Google Scholar] [CrossRef]
Oktora, S.; Kurnia, A. Hybrid Ensemble Method for Residual-Based Forecast of Time Series Data with Interventions. Model Assist. Stat. Appl. 2025, 20, 187–197. [Google Scholar] [CrossRef]
Torgo, L.; Oliveira, M. Ensembles for Time Series Forecasting. In Proceedings of the Sixth Asian Conference on Machine Learning, Nha Trang, Vietnam, 26–28 November 2014; pp. 360–370. [Google Scholar]
Hoseinzade, E.; Haratizadeh, S. CNNpred: CNN-based stock market prediction using a diverse set of variables. Expert Syst. Appl. 2019, 129, 273–285. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21); Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2020. [Google Scholar]
Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef]
Hyndman, R.; Athanasopoulos, G. Forecasting: Principles and Practice, 2nd ed.; OTexts: Melbourne, Australia, 2018; Chapter 8; Available online: https://otexts.com/fpp2/ (accessed on 31 August 2025).
Dickey, D.A.; Fuller, W.A. Distribution of the estimators for autoregressive time series with a unit root. J. Am. Stat. Assoc. 1979, 74, 427–431. [Google Scholar]
Kwiatkowski, D.; Phillips, P.C.B.; Schmidt, P.; Shin, Y. Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root? J. Econom. 1992, 54, 159–178. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Witsenhausen, H.; Wyner, A. A conditional entropy bound for a pair of discrete random variables. IEEE Trans. Inf. Theory 1975, 21, 493–501. [Google Scholar] [CrossRef]
Tzannes, N.S.; Noonan, J.P. The mutual information principle and applications. Inf. Control 1973, 22, 1–12. [Google Scholar] [CrossRef]
Huang, Y.; Zhao, Y.; Capstick, A.; Palermo, F.; Haddadi, H.; Barnaghi, P. Analyzing entropy features in time-series data for pattern recognition in neurological conditions. Artif. Intell. Med. 2024, 150, 102821. [Google Scholar] [CrossRef]
Wilder, J.W. New Concepts in Technical Trading Systems; Trend Research: Abu Dhabi, United Arab Emirates, 1978; pp. 7–63. [Google Scholar]
McDonell, W. The FX Bootcamp Guide to Strategic and Tactical Forex Trading; Wiley: Hoboken, NJ, USA, 2008; Volume 334, Chapter Lagging Indicators. [Google Scholar]
Bühlmann, P. Bagging, Boosting and Ensemble Methods. In Handbook of Computational Statistics; Springer: Berlin/Heidelberg, Germany, 2012; pp. 985–1022. [Google Scholar]
Pavlyshenko, B. Using Stacking Approaches for Machine Learning Models. In Proceedings of the 2018 IEEE Second International Conference on Data Stream Mining and Processing; IEEE: New York, NY, USA, 2018; pp. 255–258. [Google Scholar]
Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Härdle, W.K.; Horowitz, J.; Kreiss, J.P. Bootstrap Methods for Time Series. Int. Stat. Rev. 2003, 71, 435–459. [Google Scholar] [CrossRef]
Künsch, H.R. The Jackknife and the Bootstrap for General Stationary Observations. Ann. Stat. 1989, 17, 1217–1241. [Google Scholar] [CrossRef]
Politis, D.; Romano, J.P. Nonparametric Resampling for Homogeneous Strong Mixing Random Fields. J. Multivar. Anal. 1993, 47, 301–328. [Google Scholar] [CrossRef]
Politis, D.; Romano, J.P. The Stationary Bootstrap. J. Am. Stat. Assoc. 1994, 89, 1303–1313. [Google Scholar] [CrossRef]
Liu, R.Y.; Singh, K. Exploring the Limits of Bootstrap; Wiley: Hoboken, NJ, USA, 1992; Chapter Moving Blocks Jackknife and Bootstrap Capture Weak Dependence. [Google Scholar]
Bühlmann, P. Sieve Bootstrap for Time Series. Bernoulli Soc. Math. Stat. Probab. 1997, 3, 123–148. [Google Scholar] [CrossRef]
Paparoditis, E.; Politis, D. Tapered Block Bootstrap. Biometrika 2001, 88, 1105–1119. [Google Scholar] [CrossRef]
Talbi, E.G. Metaheuristics: From Design to Implementation; Wiley: Hoboken, NJ, USA, 2009; Chapter Common Concepts for Metaheuristics. [Google Scholar]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks; IEEE Service Center: Piscataway, NJ, USA, 1995; Volume 4, pp. 1942–1948. [Google Scholar]
Kirkpatrick, S.; Gelatt, C.D.; Vecchi, M.P. Optimization by Simulated Annealing. Science 1983, 220, 671–680. [Google Scholar] [CrossRef]
Holland, J.H. Genetic Algorithms. Sci. Am. 1992, 267, 66–73. [Google Scholar] [CrossRef]
Shi, Y.; Eberhart, R. A modified particle swarm optimizer. In Proceedings of the IEEE Conference on Evolutionary Computation; IEEE: New York, NY, USA, 1998; Volume 6, pp. 69–73. [Google Scholar]
Blackwell, T.; Bentley, P. Dynamic Search With Charged Swarms. In Proceedings of the Genetic and Evolutionary Computation Conference; Morgan Kaufmann: San Francisco, CA, USA, 2002; pp. 19–26. [Google Scholar]
Harrison, K.; Engelbrecht, A.; Ombuki-Berman, B. An adaptive particle swarm optimization algorithm based on optimal parameter regions. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence; IEEE: New York, NY, USA, 2017; pp. 1–8. [Google Scholar]
Poli, R.; Broomhead, D. Exact analysis of the sampling distribution for the canonical particle swarm optimiser and its convergence during stagnation. In Proceedings of the GECCO 2007: Genetic and Evolutionary Computation Conference; Association for Computing Machinery: New York, NY, USA, 2007; pp. 134–141. [Google Scholar]
Harrison, K.; Ombuki-Berman, B.; Engelbrecht, A. The Effect of Probability Distributions on the Performance of Quantum Particle Swarm Optimization for Solving Dynamic Optimization Problems. In Proceedings of the 2015 IEEE Symposium Series on Computational Intelligence; IEEE: New York, NY, USA, 2015; pp. 242–250. [Google Scholar]
Ballings, M.; Van den Poel, D.; Hespeels, N.; Gryp, R. Evaluating multiple classifiers for stock price direction prediction. Expert Syst. Appl. 2015, 42, 7046–7056. [Google Scholar] [CrossRef]
Fischer, T.; Krauss, C. Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res. 2018, 270, 654–669. [Google Scholar] [CrossRef]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice, 2nd ed.; OTexts: Melbourne, Australia, 2018; Chapter 6; Available online: https://otexts.com/fpp2/ (accessed on 31 August 2025).
Tsay, R.S. Analysis of Financial Time Series, 3rd ed.; Wiley: Hoboken, NJ, USA, 2010; Chapter 2. [Google Scholar]
Chakraborti, A.; Patriarca, M.; Santhanam, M. Financial Time-series Analysis: A Brief Overview. In Econophysics of Markets and Business Networks: Proceedings of the Econophys-Kolkata III; Springer: Milan, Italy, 2007; Volume 5, pp. 51–67. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006; Chapter 2; pp. 13–72. [Google Scholar]
Murphy, J.J. Technical Analysis of the Financial Markets; New York Institute of Finance: New York, NY, USA, 1999; Chapter 1. [Google Scholar]
López de Prado, M. Advances in Financial Machine Learning; Wiley: Hoboken, NJ, USA, 2018; Chapter 4. [Google Scholar]
Murphy, J.J. Technical Analysis of the Financial Markets, 1st ed.; New York Institute of Finance: New York, NY, USA, 1999; Chapter 11; pp. 296–302. [Google Scholar]
Murphy, J.J. Technical Analysis of the Financial Markets, 1st ed.; New York Institute of Finance: New York, NY, USA, 1999; Chapter 10; pp. 281–285. [Google Scholar]
Murphy, J.J. Technical Analysis of the Financial Markets, 1st ed.; New York Institute of Finance: New York, NY, USA, 1999; Chapter 4; pp. 62–70. [Google Scholar]
Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
Dixon, W.J.; Mood, A.M. The Statistical Sign Test. J. Am. Stat. Assoc. 1946, 41, 557–566. [Google Scholar] [CrossRef]
Cliff, N. Dominance Statistics: Ordinal Analyses to Answer Ordinal Questions. Psychol. Bull. 1993, 114, 494–509. [Google Scholar] [CrossRef]
Kruschke, J.K. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan, 2nd ed.; Academic Press: Amsterdam, The Netherlands, 2015; Chapter 9; pp. 261–300. [Google Scholar]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth International Group: Franklin, TN, USA, 1984. [Google Scholar]
Cox, D.R. The Regression Analysis of Binary Sequences. J. R. Stat. Soc. Ser. B 1958, 20, 215–242. [Google Scholar] [CrossRef]
Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1995; Chapter 4; pp. 116–120. [Google Scholar]
Fan, R.E.; Chang, K.W.; Hsieh, C.J.; Wang, X.R.; Lin, C.J. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–13. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML); Omnipress: Madison, WI, USA, 2010; pp. 807–814. [Google Scholar]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time Series Analysis: Forecasting and Control; Prentice Hall: Englewood Cliffs, NJ, USA, 1994; Chapter 2; pp. 29–72. [Google Scholar]
Cont, R. Empirical properties of asset returns: Stylized facts and statistical issues. Quant. Financ. 2001, 1, 223–236. [Google Scholar] [CrossRef]
Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 1994; Chapter 3; pp. 45–91. [Google Scholar]
Sun, J.; Xu, W.; Feng, W. A Global Search Strategy of Quantum-Behaved Particle Swarm Optimization. In Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems; IEEE: New York, NY, USA, 2004; pp. 111–116. [Google Scholar]
Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012; Chapter 3; pp. 85–90. [Google Scholar]
Lo, A.W. The Adaptive Markets Hypothesis: Market Efficiency from an Evolutionary Perspective. J. Portf. Manag. 2004, 30, 15–29. [Google Scholar] [CrossRef]
Dietterich, T.G. Ensemble Methods in Machine Learning. In Proceedings of the Multiple Classifier Systems; Kittler, J., Roli, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
Kuncheva, L.I.; Whitaker, C.J. That Elusive Diversity in Classifier Ensembles. Mach. Learn. 2003, 51, 181–207. [Google Scholar] [CrossRef]
Opitz, D.; Maclin, R. Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Res. 1999, 11, 169–198. [Google Scholar] [CrossRef]

Figure 1. Block bootstrap methods for time series data.

Figure 2. End-to-end empirical workflow for ensemble evaluation.

Figure 3. Daily difference curve for the USDZAR exchange rate with highlighted offset or do-nothing region.

Figure 4. RSI and ADX actionable regions developed by the authors.

Table 1. Definitions of Stationarity.

Type of Stationarity	Definition
Stationary process	A process that generates a stationary series of observations.
Stationary model	A model that describes a stationary series of observations.
Trend stationary	A time series that does not exhibit a trend.
Seasonal stationary	A time series that does not exhibit seasonality.
Strictly stationary	The statistical properties of a time series such as the mean, variance and covariance do not vary with time or are invariant to a time shift.
First-order stationary	A time series that can have a mean which does not vary with time. However, other statistical properties can change with time such as the variance.
Second-order stationary	Also called a weakly stationary time series, has a constant mean and variance, and an auto-covariance that is invariant to a shift in time. Other statistical properties can change with time. This constrained version of strict stationarity is very common in practice.

Table 2. Types of leading and lagging indicators.

FTS Indicator	Type	Definition
Momentum	Leading	Rate of change in an FTS to identify stages of momentum.
Average directional index (ADX)	Leading	Measures the strength of a trend. An ADX greater than 25 signifies a strong trend.
Average true range (ATR)	Leading	High–low range of an FTS within any time interval.
Relative strength index (RSI)	Leading	Ratio of upward to downward moves in an FTS over a fixed interval (0–100). Values above 70 indicate “overbought” and below 30 indicate “oversold,” both suggesting trend reversals.
Stochastic oscillator	Leading	Momentum indicator showing the placement of the last value relative to the high–low range over a fixed interval.
Simple moving average (SMA)	Lagging	Average of a series over a specified time period.
Exponential moving average (EMA)	Lagging	Average of a series with exponential smoothing applied.
Bollinger Bands	Lagging	A moving average with upper and lower bounds typically two standard deviations apart.
Moving average convergence divergence (MACD)	Lagging	Momentum indicator based on moving averages. Convergence indicates weakening trends; divergence indicates strengthening trends.

Table 3. Comparative summary of representative studies on financial time series modeling. Abbreviations: ECON, econometrics; ML, machine learning; DL, deep learning; ENS, ensemble learning; META, metaheuristic optimization; TS, time series; FTS, financial time series; F, forecasting; C, classification.

Study	Domain	Methods	Data	Task	Key Findings	Relevance
Soto et al. [12]	ENS and ML	Fuzzy neural network ENS	TS	F	Optimizing fuzzy integrators improves ENS accuracy	Shows importance of ENS component optimization
He et al. [13]	DL and ENS	DL ENS	FTS	F	DL ENS outperform individual DL models	Supports DL baselines and ENS relevance
Mohammed & Kora [14]	ENS and DL	Review of ENS–DL methods	Multiple	F, C	Identifies factors influencing ENS success	Supports multi- dimensional ENS design
Sakib et al. [15]	ENS and DL	Review of ENS–DL techniques	TS	F	Highlights challenges in ENS–DL	Motivates systematic ENS evaluation
Gonzalo et al. [16]	ML and ENS	ENS pruning	TS	C	Pruning improves ENS efficiency	Relevant to voting and ENS size
Zhongchen et al. [17]	ML and ENS	Rank-based pruning	TS	F	New pruning metrics improve selection	Relevant to ENS evaluation
Rezk & Selim [18]	ENS and META	META weighting strategies	Multiple	F, C	META improves ENS weighting	Supports ENS weight optimization using PSO and QPSO
Zhang [19]	DL and ENS	Neural network ENS with jittered bootstrap	TS	F	Noise-based bootstrap improves diversity	Relevant to bootstrap diversity
Rahman et al. [20]	DL and ENS	Layered ENS	TS	F	Layered ENS improve accuracy	Relevant to ENS architecture
Oktora & Kurnia [21]	ECON and ENS	Hybrid residual- based ENS	TS	F	Hybrid ENS improve residual modeling	Illustrates hybrid ECON–ENS approaches
Box & Jenkins [2]	ECON	ARIMA	TS	F	Foundational linear TS model	Included as ECON baseline
Bollerslev [25]	ECON	GARCH	FTS	F	Models volatility clustering	Supports volatility modeling relevance
Ballings et al. [54]	ML	Comparison of common ML classifiers	FTS	C	Classifier performance varies across datasets	Justifies ML baselines
Fischer & Krauss [55]	DL	LSTM	FTS	C	LSTM outperform random forest and LR	Justifies DL baseline
Zhou et al. [24]	DL	Informer (Transformer)	TS	F	Efficient long- range modeling	Supports acknowledging Transformer models
Present study	ML and DL and ECON	ENS using standard ML, DL, and ECON baselines	FTS	C	Unified evaluation across domains	Provides end-to-end ENS design assessment

Note: Bold values highlight the contribution of the present study.

Table 4. FTS data sets used in this study.

Data Set	Date Range	Observations	One-Day-Samples	Features	Asset Class
BCOUSD	15 November 2010 to 29 December 2023	4,930,561	3425	5	Commodity
ETXEUR	15 November 2010 to 1 February 2019	3,087,361	2145	5	Stock
JPXJPY	15 November 2010 to 29 December 2023	4,930,561	3425	5	Stock
JPYZAR	15 November 2010 to 29 December 2023	4,930,561	3425	5	Exchange rate
USDZAR	15 November 2010 to 29 December 2023	4,930,561	3425	5	Exchange rate
XAUUSD	16 March 2009 to 29 December 2023	5,556,961	3860	5	Commodity

Table 5. Stationarity results using the ADF and KPSS tests.

Data Set	One-Day Samples	Full Data Set
BCOUSD	Weakly stationary	Weakly stationary
ETXEUR	Weakly stationary	Weakly stationary
JPXJPY	Weakly stationary	Weakly stationary
JPYZAR	Weakly stationary	Weakly stationary
USDZAR	Weakly stationary	Weakly stationary
XAUUSD	Weakly stationary	Weakly stationary

Table 6. Entropy results from the entropy, conditional entropy (CE), and mutual information (MI) tests.

Data Set	Block Entropy	Conditional Entropy	Mutual Information
BCOUSD	11.2 units	<2.1 units across all features	11.0 units
ETXEUR	8.8 units	<1.2 units across all features	8.8 units
JPXJPY	11.3 units	<1.6 units across all features	11.0 units
JPYZAR	11.1 units	<2.2 units across all features	11.0 units
USDZAR	11.4 units	<1.8 units across all features	11.0 units
XAUUSD	11.5 units	<1.6 units across all features	11.0 units

Table 7. Offset values and class distributions per data set.

Data Set	Offset	Buy	Sell	Do-Nothing
BCOUSD	5 US dollars	1585 (46%)	1210 (35%)	630 (18%)
ETXEUR	50 euros	786 (37%)	136 (6%)	1223 (57%)
JPXJPY	50 Japanese yen	1603 (47%)	724 (21%)	1098 (32%)
JPYZAR	2.5 South African cents	1222 (36%)	396 (12%)	1807 (53%)
USDZAR	2.5 South African cents	1317 (38%)	810 (24%)	1298 (38%)
XAUUSD	5 US dollars	1912 (50%)	437 (11%)	1511 (39%)

Table 8. Model parameters for ensemble base learners and standalone baseline models.

Model	Description of Key Control Parameters
Decision tree	Uses the `gini` criterion [69] for node splitting.
Logistic regression	Employs the `liblinear` solver [76] for parameter optimization.
Multi-layer perceptron	Utilizes the `adam` solver [77], the `relu` activation function [78], and a hidden layer of 160 units.
ARIMA	Order $(p, 1, q)$ selected using an AIC-based grid search over $p, q \in {0, 1, 2}$ [79]; fitted to differenced daily closing prices using `statsmodels`.
LSTM	Single LSTM layer with 32 units, followed by a dense softmax output layer; trained using the `adam` optimizer (learning rate 0.001) [77] and sparse categorical cross-entropy loss.

Table 9. Self-Adaptive PSO and QPSO algorithm control parameters.

Algorithm	Parameters
Equal Weighted Majority	None
Self-Adaptive PSO	k = 5, iterations = 100, particles = 30
Self-Adaptive QPSO	k = 5, d = `uniform`, r = 0.5, quantum particles = 15, iterations = 100, particles = 15

Table 10. Average accuracy with standard deviation for each model and dataset.

	DT	LR	MLP
BCOUSD	58.0 ± 0.0525%	62.1 ± 0.0434%	62.4 ± 0.0456%
ETXEUR	57.6 ± 0.0385%	58.5 ± 0.0286%	54.1 ± 0.0879%
JPXJPY	48.1 ± 0.0380%	49.9 ± 0.0311%	48.6 ± 0.0290%
JPYZAR	56.0 ± 0.0640%	58.5 ± 0.0506%	52.5 ± 0.0939%
USDZAR	47.3 ± 0.0363%	49.5 ± 0.0261%	47.9 ± 0.0453%
XAUUSD	59.0 ± 0.0492%	63.3 ± 0.0313%	56.0 ± 0.0630%

Note: Bold values indicate the best-performing model for each FTS data set.

Table 11. Average profit with standard deviation for each model and dataset.

	DT	LR	MLP
BCOUSD	179.26 ± 63.17	214.34 ± 73.37	212.91 ± 76.16
ETXEUR	1015.65 ± 376.35	1125.07 ± 452.11	1049.07 ± 673.73
JPXJPY	17,074.44 ± 8380.18	19,630.83 ± 7431.40	19,473.18 ± 10,196.74
JPYZAR	6.01 ± 1.64	6.26 ± 2.23	4.95 ± 3.15
USDZAR	10.65 ± 3.19	11.47 ± 3.22	12.19 ± 4.21
XAUUSD	1059.79 ± 372.02	1157.67 ± 354.93	769.77 ± 556.79

Note: Bold values indicate the best-performing model for each FTS data set.

Table 12. Best bootstrap method and voting mechanism performance per data set.

Data Set	Bootstrap Method		Voting Mechanism
Data Set	Accuracy	Profit	Accuracy	Profit
BCOUSD	tukey	tukey	PSO	PSO
ETXEUR	stationary	tukey	PSO	QPSO
JPXJPY	tukey	tukey	Majority	PSO
JPYZAR	stationary	tukey	Majority	PSO
USDZAR	tukey	tukey	PSO	QPSO
XAUUSD	tukey	moving block	PSO	PSO

Table 13. Best ensemble size and loss function performance per data set.

Data Set	Ensemble Size		Loss Function
Data Set	Accuracy	Profit	Accuracy	Profit
BCOUSD	100	100	entropy	entropy
ETXEUR	50	30	entropy	profit
JPXJPY	150	50	entropy	profit
JPYZAR	50	50	entropy	profit
USDZAR	50	150	entropy	profit
XAUUSD	50	50	entropy	profit

Table 14. Best bootstrap method performance per model.

Bootstrap Method	Metric: Accuracy			Metric: Profit
Bootstrap Method	DT	LR	MLP	DT	LR	MLP
Block	0	1	0	0	2	0
Block Sieve	0	0	0	0	0	0
Moving Block	0	0	0	0	0	1
Stationary	2	0	1	2	0	0
Subsample	0	4	0	0	2	1
Tukey	4	1	5	4	2	4
Total	6	6	6	6	6	6

Note: Bold values indicate the best-performing model per bootstrap method across all FTS data sets.

Table 15. Best bootstrap method performance per voting mechanism.

Bootstrap Method	Metric: Accuracy			Metric: Profit
Bootstrap Method	Majority	PSO	QPSO	Majority	PSO	QPSO
Block	3	1	2	1	3	2
Block Sieve	1	1	4	0	1	5
Moving Block	0	0	6	0	0	6
Stationary	4	1	1	2	2	2
Subsample	3	3	0	1	5	0
Tukey	4	2	0	0	4	2
Total	15	8	13	4	15	17

Note: Bold values indicate the best-performing bootstrap method per voting mechanism across all FTS data sets.

Table 16. Best bootstrap method performance per ensemble size.

Bootstrap Method	Metric: Accuracy					Metric: Profit
Bootstrap Method	10	30	50	100	150	10	30	50	100	150
Block	0	0	0	0	0	0	0	0	0	0
Block Sieve	0	0	0	0	0	0	0	0	0	0
Moving Block	0	0	0	0	0	0	0	0	1	1
Stationary	0	2	2	2	3	0	1	1	2	1
Subsample	1	0	0	0	0	1	0	0	0	0
Tukey	5	4	4	4	3	5	5	5	3	4
Total	6	6	6	6	6	6	6	6	6	6

Note: Bold values indicate the best-performing bootstrap method per ensemble size across all FTS data sets.

Table 17. Best model performance per voting mechanism.

Model	Metric: Accuracy			Metric: Profit
Model	Majority	PSO	QPSO	Majority	PSO	QPSO
DT	5	1	0	1	4	1
LR	1	0	5	0	1	5
MLP	0	2	4	0	1	5
Total	6	3	9	1	6	11

Note: Bold values indicate the best-performing model per voting mechanism across all FTS data sets.

Table 18. Best ensemble size per voting mechanism.

Ensemble Size	Metric: Accuracy			Metric: Profit
Ensemble Size	Majority	PSO	QPSO	Majority	PSO	QPSO
10	5	1	0	0	2	4
30	3	3	0	0	4	2
50	2	4	0	1	4	1
100	1	3	2	0	4	2
150	1	4	1	1	3	2
Total	12	15	3	2	17	11

Note: Bold values indicate the best-performing ensemble size per voting mechanism across all FTS data sets.

Table 19. Best loss function performance per voting mechanism.

Voting Mechanism	Metric: Accuracy		Metric: Profit
Voting Mechanism	Entropy	Profit	Entropy	Profit
PSO	5	1	0	6
QPSO	6	0	1	5
Total	11	1	1	11

Note: Bold values indicate the best-performing loss function per voting mechanism across all FTS data sets.

Table 20. Best model performance per loss function.

Model	Metric: Accuracy		Metric: Profit
Model	Entropy	Profit	Entropy	Profit
DT	6	0	4	2
LR	5	1	0	6
MLP	4	2	0	6
Total	15	3	4	14

Note: Bold values indicate the best-performing model per loss function across all FTS data sets.

Table 21. Best loss function performance per bootstrap method.

Bootstrap Method	Metric: Accuracy		Metric: Profit
Bootstrap Method	Entropy	Profit	Entropy	Profit
Block	4	2	3	3
Block Sieve	4	2	0	6
Moving Block	3	3	0	6
Stationary	3	3	1	5
Subsample	6	0	4	2
Tukey	3	3	0	6
Total	23	13	8	28

Note: Bold values indicate the best-performing loss function per bootstrap method across all FTS data sets.

Table 22. Best number of ensemble members per loss function.

Ensemble Size	Metric: Accuracy		Metric: Profit
Ensemble Size	Entropy	Profit	Entropy	Profit
10	5	1	0	6
30	6	0	2	4
50	5	1	2	4
100	3	3	2	4
150	3	3	2	4
Total	22	8	8	22

Note: Bold values indicate the best-performing ensemble size per loss function across all FTS data sets.

Table 23. Best model performance per ensemble size.

Model	Metric: Accuracy					Metric: Profit
Model	10	30	50	100	150	10	30	50	100	150
DT	0	0	0	0	0	0	0	0	0	0
LR	6	6	6	5	5	3	3	3	4	4
MLP	0	0	0	1	1	3	3	3	2	2
Total	6	6	6	6	6	6	6	6	6	6

Note: Bold values indicate the best-performing model per ensemble size across all FTS data sets.

Table 24. Best-performing ensemble configurations for accuracy and profit across all datasets.

Dataset	Metric	Model	Bootstrap	Voting	Members	Loss
BCOUSD	Accuracy	LR	Tukey	PSO	100	Entropy
BCOUSD	Profit	LR	Tukey	PSO	100	Profit
ETXEUR	Accuracy	LR	Stationary	PSO	50	Entropy
ETXEUR	Profit	LR	Tukey	QPSO	30	Profit
JPXJPY	Accuracy	LR	Tukey	Average	150	Entropy
JPXJPY	Profit	LR	Tukey	PSO	50	Profit
JPYZAR	Accuracy	LR	Stationary	Average	50	Entropy
JPYZAR	Profit	LR	Tukey	PSO	50	Profit
USDZAR	Accuracy	LR	Tukey	PSO	50	Entropy
USDZAR	Profit	MLP	Tukey	QPSO	150	Profit
XAUUSD	Accuracy	LR	Tukey	PSO	50	Entropy
XAUUSD	Profit	LR	MBB	PSO	50	Profit

Table 25. Summary of ensemble performance relative to baseline models for each dataset based on accuracy. “Winner” reflects the model with the highest Bayesian posterior probability. “Strength” refers to the magnitude of Bayesian evidence.

Dataset	Winner	Strength	Comment
BCOUSD	Ensemble	Strong to very strong	Ensemble outperforms all baselines with consistently positive confidence intervals and moderate to large effect sizes.
ETXEUR	Ensemble	Strong	Ensemble dominates most baselines; logistic regression shows a slight advantage but with weak evidence and CI crossing zero.
JPXJPY	Ensemble	Strong	Ensemble outperforms most baselines; LSTM shows a small baseline advantage with weak evidence.
JPYZAR	Ensemble	Strong	Ensemble wins against most baselines; logistic regression shows a modest baseline advantage but with small effect size.
USDZAR	Ensemble	Strong to very strong	Ensemble dominates ARIMA, DT, and MLP; LSTM shows a moderate baseline advantage with small effect size.
XAUUSD	Ensemble	Strong	Ensemble outperforms most baselines; logistic regression shows moderate baseline advantage but with CI crossing zero.

Table 26. Summary of ensemble performance relative to baseline models for each dataset based on profit. “Winner” reflects the model with the highest Bayesian posterior probability. “Strength” refers to the magnitude of Bayesian evidence.

Dataset	Winner	Strength	Comment
BCOUSD	Ensemble	Very strong	Ensemble consistently outperforms all baselines with large positive profit differences and strong effect sizes.
ETXEUR	Ensemble	Strong	Ensemble dominates most baselines; logistic regression shows a slight advantage but with weak evidence and confidence intervals crossing zero.
JPXJPY	Ensemble	Very strong	Ensemble profit is dramatically higher across all baselines; extremely large confidence intervals and effect sizes indicate decisive superiority.
JPYZAR	Ensemble	Strong	Ensemble outperforms most baselines; logistic regression shows a modest baseline advantage but with small effect size and mixed evidence.
USDZAR	Ensemble	Very strong	Ensemble profit is consistently higher with strong to very strong Bayesian evidence and positive confidence intervals.
XAUUSD	Ensemble	Strong	Ensemble outperforms most baselines; logistic regression shows moderate baseline advantage but with small effect size and CI crossing zero.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nannoolal, A.; Engelbrecht, A.P. Ensemble Approach for Financial Time Series Modeling. Algorithms 2026, 19, 404. https://doi.org/10.3390/a19050404

AMA Style

Nannoolal A, Engelbrecht AP. Ensemble Approach for Financial Time Series Modeling. Algorithms. 2026; 19(5):404. https://doi.org/10.3390/a19050404

Chicago/Turabian Style

Nannoolal, Aveer, and Andries P. Engelbrecht. 2026. "Ensemble Approach for Financial Time Series Modeling" Algorithms 19, no. 5: 404. https://doi.org/10.3390/a19050404

APA Style

Nannoolal, A., & Engelbrecht, A. P. (2026). Ensemble Approach for Financial Time Series Modeling. Algorithms, 19(5), 404. https://doi.org/10.3390/a19050404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ensemble Approach for Financial Time Series Modeling

Abstract

1. Introduction

2. Related Works

2.1. Time Series

2.2. Stationarity

2.3. Entropy

2.4. Financial Time Series Indicators

2.5. Ensemble Modeling

2.6. Bootstrap Methods for Time Series

2.7. Metaheuristics

2.8. Comparison of Financial Time Series Techniques

3. Empirical Process

3.1. Empirical Workflow Overview

3.2. Datasets

3.3. Data Preprocessing

3.3.1. Data Quality Assessments and Corrections

3.3.2. Stationarity

3.3.3. Entropy

3.3.4. Labeling

3.4. Evaluation Metrics

3.4.1. Classification Accuracy

3.4.2. Profit Metric

3.4.3. Statistical Comparison of Model Performance

Wilcoxon Signed-Rank Test

Sign Test

Cliff’s Delta

Bootstrap Confidence Intervals

Bayesian Posterior Probabilities

4. Model Development

4.1. Base Models

4.2. Bootstrap Methods

4.3. Voting Mechanisms

4.4. Ensemble Sizes

4.5. Training Procedure

5. Ensemble Results

5.1. Overall Performance

5.2. Performance of Bootstrap Methods

5.3. Performance of Voting Mechanisms

5.4. Performance Impact on Ensemble Sizes

6. Statistical Evaluation of Ensembles vs. Base Line Models

6.1. Formulation of Statistical Tests

6.2. Findings for the Accuracy Metric

6.3. Findings for the Profit Metric

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Ensemble vs. Baseline Models Statistical Tests

Appendix A.1. Accuracy Statistical Analysis

Appendix A.2. Profit Statistical Analysis

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI