1. Introduction
The rapid advancement of machine learning (ML) has made it an indispensable tool for solving complex problems across diverse fields, from finance (
Gu et al., 2020) to healthcare (
Russell & Norvig, 2010). However, the reliability of ML models is frequently undermined by a persistent challenge: overfitting. Overfitting occurs when a model learns the noise and idiosyncratic details of its training data, rather than the underlying pattern. This leads to a failure to generalize, where the model performs well on training data but poorly on new, unseen data (
Geman et al., 1992).
To combat this issue, a variety of techniques have been developed. Regularization methods, such as Ridge (L2) and Lasso (L1) regression, penalize model complexity by adding a penalty term to the loss function based on the magnitude of the model’s coefficients (
Hoerl & Kennard, 1970;
Tibshirani, 1996). Other common approaches include cross-validation, which assesses generalization by partitioning data into training and validation sets (
Stone, 1974), and dropout, which randomly deactivates neurons during training to prevent co-adaptation in neural networks (
Srivastava et al., 2014).
In quantitative finance, overfitting is exacerbated by optimizer-driven searches over strategies, feature variants, and parameterizations. Selecting the best backtest among many candidates is a multiple-testing/data-snooping problem that inflates apparent statistical significance (
Hansen, 2005;
Harvey et al., 2016;
Sullivan et al., 1999;
White, 2000). Moreover, statistical inference for risk-adjusted metrics (e.g., Sharpe Ratio) requires care under the fat tails and dependence that characterize financial returns (
Cont, 2001;
Ledoit & Wolf, 2008;
Lo, 2002). Recent work further highlights the role of selection bias and proposes adjusted performance statistics and diagnostics for backtest overfitting (
Bailey et al., 2016;
Bailey & López de Prado, 2014).
This research addresses this gap by proposing the GT-Score (Golden Ticket Score), a composite objective function that embeds anti-overfitting principles directly into the optimization objective (
Sheppert, 2025). The GT-Score combines multiple facets of a desirable model, including performance, statistical significance, consistency, and downside risk, into a single objective function. By optimizing for this composite score, the optimizer is guided to discard spurious patterns and favor solutions that are more robust and more likely to generalize out of sample.
2. Materials and Methods
The methodology was designed to benchmark the GT-Score against common objective functions and to evaluate whether it offers a practical reduction in overfitting under standard backtesting procedures.
2.1. Dataset and Experimental Environment
The primary dataset consisted of historical daily Open, High, Low, Close, Volume (OHLCV) price data for the top 50 S&P 500 companies by market capitalization, sourced via the yFinance API (
Aroussi, 2020). The data spans January 2010 through December 2024, providing approximately 3770 trading days per asset and covering multiple market regimes (including the post-2008 recovery, the 2020 COVID-19 crash, and the 2022 interest rate environment).
The entire experimental framework was implemented in Python 3.10 using a custom backtesting engine. The code is provided as
Supplementary Material for reproducibility.
2.2. Trading Strategies
Three well-established technical trading strategies were employed to test the optimization frameworks:
Relative Strength Index (RSI): A momentum oscillator with optimizable overbought/oversold thresholds (
Wilder, 1978).
Moving Average Convergence Divergence (MACD): A trend-following strategy with optimizable fast, slow, and signal periods (
Appel, 1979).
Bollinger Bands: A mean-reversion strategy with an optimizable lookback window and band width (
Bollinger, 2001).
2.3. Walk-Forward Validation
To assess out-of-sample performance in a realistic time-series setting, we employed walk-forward validation (
López de Prado, 2018;
Pardo, 1992) with the following structure:
This generated nine sequential train/validation splits per asset, covering validation periods from 2014–2024.
2.4. Monte Carlo Analysis
To assess stability and reduce dependence on random initialization, each configuration was run 15 times with different random seeds (42–56), yielding 9000 total optimization trials for the Monte Carlo study.
2.5. Optimization Framework
Random search was employed as the optimization method with 25 parameter evaluations per trial. This method, while simple, has been shown to be competitive with more sophisticated approaches for moderate-dimensional parameter spaces (
Bergstra & Bengio, 2012). All objective functions were allocated the same evaluation budget to ensure a fair comparison. The author’s dissertation (
Sheppert, 2025) additionally tested Bayesian optimization (TPE) and genetic algorithms, finding consistent conclusions across optimization paradigms.
2.6. Comparative Objective Functions
The performance of the GT-Score was compared against three conventional objective functions:
3. Theory and Calculation
3.1. The GT-Score Formulation
The GT-Score is a composite objective function designed to unify performance with measures of robustness and statistical validity. Its mathematical formulation is
The components are defined as
: The mean strategy return per observation.
: The mean benchmark return (buy-and-hold) per observation.
: The standard deviation of strategy returns.
N: The number of return observations used to compute and .
z: A Z-score measuring statistical significance of outperformance:
: The natural logarithm of the Z-score, acting as a significance gate.
: The R-squared value measuring consistency of returns.
Equation (
2) is a standardized excess-mean statistic (often treated as a
t-statistic when
is estimated from the same sample). Its probabilistic interpretation relies on an approximate Gaussian sampling distribution for the mean under i.i.d. assumptions and/or large-
N asymptotics. In practice, trade-level returns can be fat-tailed, heteroskedastic, and autocorrelated (
Cont, 2001), which reduces effective sample size and can miscalibrate this parametric “significance” filter. In this paper, we therefore treat
z primarily as a heuristic gating term for optimization rather than as an exact hypothesis test; more robust non-parametric and dependence-aware alternatives are discussed in the Discussion.
3.2. Edge Case Handling
To ensure numerical stability and economically meaningful behavior, the GT-Score employs a piecewise definition based on the z-score value:
The smoothing parameter prevents division by zero when .
3.3. Theoretical Justification
The multiplicative structure of the GT-Score can be understood as a composite utility function where each component acts as a filter:
The term acts as a significance gate, rejecting strategies that do not outperform the benchmark beyond sampling noise. Using instead of z compresses large values so the significance term does not dominate the composite score, and it anchors at where (Algorithm 1).
The
term penalizes erratic performance that relies on outlier trades, promoting consistency (
Kestner, 1996).
| Algorithm 1 GT-Score Piecewise Definition |
| 1: if then | ▹ Underperforms benchmark |
| 2: | ▹ Large penalty |
| 3: else if then | ▹ Marginal outperformance |
| 4: | ▹ Smooth transition |
| 5: else | ▹ Outperforms benchmark () |
| 6: | ▹ Standard GT-Score |
| 7: end if | |
3.4. Minimum Sample Size () and Optional Period Stabilization
In Equation (
2),
N denotes the number of return observations used in the Z-score (and in estimating
and
). In this study we compute
z using trade-level returns, so
N equals the number of executed trades in the backtest window.
Because small samples produce unstable estimates of , , and the standard error term , we impose a minimum-trade threshold of : parameterizations generating fewer than 50 trades are assigned a large penalty during optimization. We recommend as a practical default because it provides a minimally stable sampling base for the Z-score, reduces sensitivity to a handful of outlier trades, and offers a consistent baseline that facilitates comparison across studies and users.
The dissertation (
Sheppert, 2025) additionally used an adaptive “stabilized variance” option for periodization. In that variant, the equity curve is partitioned into
n equal-length time periods and
n is chosen by searching for an
at which the variance of period returns changes by less than a small threshold across recent candidates (1% in the reference implementation), with a fallback to
when no plateau is detected. The motivation is to reduce sensitivity to an arbitrary choice of
n by selecting a periodization where the variability of period returns is empirically stable for the given backtest window. The accompanying code supports both a fixed-threshold mode (with default
) and the stabilized option; results reported here use the fixed-threshold setting. Practitioners may adjust the minimum threshold or enable stabilization based on expected trade frequency and desired statistical power, but we suggest retaining the 50-trade minimum as a common reference point for comparability.
4. Results
4.1. Monte Carlo Study Results
The Monte Carlo study comprised 9000 optimization trials across 50 assets, four loss functions, three strategies, and 15 random seeds.
Table 1 summarizes the results.
Key Finding: While GT-Score achieves slightly lower raw test returns (43.6% vs. 46–50%), it demonstrates a 56% higher generalization ratio (0.183 vs. ∼0.117), indicating substantially better retention of in-sample performance when evaluated out of sample.
This pattern reflects an expected trade-off. Objectives that directly maximize profit (Simple) or downside-risk-adjusted return (Sortino) can achieve modestly higher mean test returns, but they do so by selecting parameterizations with substantially higher in-sample returns, resulting in larger train–test performance decay. With 2250 paired trials per objective, small differences in mean returns can be statistically detectable even when effect sizes are economically small (Table 3). In practical model selection settings where many parameterizations are screened, improved generalization can reduce the risk that the selected strategy is an artifact of the search rather than a persistent signal. The primary advantage of the GT-Score is improved reliability rather than maximizing raw returns.
4.2. Walk-Forward Validation Results
The walk-forward validation comprised 5340 sequential optimization trials across nine time periods.
Table 2 summarizes the results.
Key Finding: The GT-Score achieves a 98% improvement in generalization ratio (0.365 vs. 0.185 average for baselines). This demonstrates that GT-Score strategies retain nearly twice as much of their training performance when applied to truly out-of-sample data.
4.3. Statistical Significance
Table 3 presents formal statistical comparisons between GT-Score and baseline methods.
These comparisons are used to assess whether objective functions differ on average under identical search budgets and evaluation protocols; they are not post-selection p-values for individual “discovered” strategies after exploring many variants. Because this study operates in a repeated-search setting, conventional inference can be overconfident without explicit data-snooping corrections and selection-aware reporting; we therefore emphasize effect sizes and the walk-forward generalization results as the primary evidence, and we treat the reported p-values as descriptive.
4.4. Performance Across Market Regimes
Table 4 reports validation returns by period to illustrate robustness across market regimes and to contextualize generalization improvements with raw out-of-sample performance.
4.5. Strategy-Level Analysis
Table 5 shows performance breakdown by trading strategy.
4.6. Visualization of Overfitting Reduction
To provide a clear visual comparison of overfitting reduction, we present two figures.
Figure 1 illustrates the difference in generalization capability across objectives. The generalization ratio (validation return divided by training return) measures how much of the training performance is retained when the strategy is applied to unseen data. GT-Score’s higher ratio indicates that it selects parameterizations that are less sensitive to the specific training window.
Figure 2 shows the generalization ratio computed separately for each time period. GT-Score demonstrates more consistent retention of training performance across different market regimes, including the volatile 2020–2022 period.
5. Discussion
The empirical results are consistent with the hypothesis that the GT-Score mitigates overfitting more effectively than conventional objective functions. GT-Score achieves a 98% higher generalization ratio in walk-forward validation (0.365 vs. 0.185), which is the central contribution of this work. Other methods achieve higher training returns but do not translate as well to out-of-sample performance.
This pattern reflects an expected return-versus-robustness trade-off. Profit- and Sortino-optimized strategies achieve slightly higher mean test returns but exhibit materially worse performance retention from training to out-of-sample data. GT-Score sacrifices a small amount of average return for substantially improved reliability, which is the central goal of an anti-overfitting objective.
Additionally, GT-Score delivers validation returns that are broadly comparable to baselines across market periods, suggesting genuine robustness rather than period-specific optimization.
5.1. Transaction Costs and Implementability
While the main study isolates the effect of the objective function under a daily-bar backtest (and therefore does not attempt instrument-specific execution modeling), implementability requires some cost awareness. We therefore report a lightweight transaction-cost sensitivity analysis on the Monte Carlo out-of-sample returns by subtracting an additional per-side cost (entry and exit) proportional to the number of test-window trades. This is particularly relevant because GT-Score-selected strategies exhibit higher average trade counts in our experiments (mean test-window trades 32.4 vs. ∼21.0 for baselines), so realistic trading frictions could disproportionately impact net performance.
Figure 3 shows the expected monotone decline in net returns as costs increase. Crucially, the relative ordering of the objective functions remains largely stable across the tested range (0–10 bps), suggesting that the robustness benefits of the GT-Score are not merely a function of high-turnover noise that disappears under moderate friction.
5.2. When Not to Use GT-Score
The GT-Score may be less appropriate when
The number of trades is low (the z-score calculation requires sufficient sample size);
Strategies with highly skewed return distributions (the term may over-penalize);
Real-time adaptation is required (the statistical components add computational overhead).
5.3. Limitations
This study has several limitations:
Experiments were confined to equity markets and technical trading strategies. The asset universe of 50 large-cap U.S. equities, while substantial, may not fully represent the diversity of tradable instruments; future work could extend to small-cap, international, and alternative asset classes.
The evaluation is restricted to daily-bar backtests (OHLCV) and does not include alternative data frequencies or paper/live trading evaluation. Accordingly, results should be interpreted as an objective-function comparison under controlled historical testing, not as a claim of deployability without further validation.
The
z-score gate is parametric and relies on approximate Gaussian/i.i.d. assumptions for interpreting
as a standard error. Financial returns are known to exhibit fat tails (leptokurtosis), heteroskedasticity, and autocorrelation (
Cont, 2001), which can undermine the reliability of this “significance” filter by overstating effective sample size. In this paper we treat
z as a heuristic optimization screen rather than an exact hypothesis test; more robust bootstrap or dependence-aware standard-error estimation could replace it in future work.
Although GT-Score is motivated by the multiple-testing/data-snooping problem in strategy searches, the method-comparison p-values reported here are conventional tests on realized out-of-sample returns and do not implement an explicit data-snooping correction or selection-bias adjustment integrated into inference. Readers should therefore interpret headline significance cautiously; formal multiple-testing control and selection-aware performance statistics are natural extensions.
Transaction costs were not explicitly modeled in the main tables;
Figure 3 provides a simple sensitivity check. Full execution-cost modeling (commissions, spreads, slippage, and liquidity constraints) is left for future work.
This study employed random search optimization for simplicity and reproducibility. A broader optimizer comparison is provided in the author’s dissertation (
Sheppert, 2025); gradient-based deep learning optimizers remain an open area for future research.
5.4. Future Work
Several extensions are natural. First, more robust inference could replace or augment the parametric Z-score gate with non-parametric/bootstrap approaches and explicit multiple-testing control to better handle dependence and large strategy searches (
Hansen, 2005;
Sullivan et al., 1999;
White, 2000). Second, more conservative performance reporting could incorporate selection-bias-adjusted statistics and explicit diagnostics for backtest overfitting (
Bailey et al., 2016;
Bailey & López de Prado, 2014). Third, because fat tails are central to risk measurement (
Cont, 2001;
Johnston, 2025), future work can integrate tail-risk-aware components into the objective or evaluation. Finally, broader testing across alternative data frequencies and (paper/live) evaluation would clarify practical deployability and regime sensitivity. Future research should also investigate the specific impact of alternative execution models (e.g., incorporating spread, slippage, and liquidity constraints) on the relative ranking of objective functions.
6. Conclusions
This paper presented an expanded empirical evaluation of the GT-Score across 50 stocks, nine time periods, and over 14,000 optimization trials. The key findings are
GT-Score reduces overfitting by 98% compared to conventional loss functions, as measured by generalization ratio.
Paired tests on Monte Carlo out-of-sample returns indicate statistically detectable differences between objectives (
Table 3), with small effect sizes.
Out-of-sample performance is broadly comparable across market regimes and trading strategies.
For researchers and practitioners building quantitative backtesting pipelines, the GT-Score offers an alternative to traditional optimization objectives when the goal is robust model selection under repeated search. The evidence here supports improved out-of-sample generalization on daily equity data. While this study does not claim production-ready performance without further validation (e.g., paper/live trading), the results suggest that embedding robustness constraints into the optimization objective can materially reduce the “optimism bias” inherent in backtest-driven strategy development.
Reproducible code implementing the GT-Score, all experiments, and statistical analyses is provided as
Supplementary Material.