Human-AI Synergy in Statistical Arbitrage: Enhancing Robustness Across Volatile Financial Markets

Lei, Binxu

doi:10.3390/risks14030063

Open AccessArticle

Human-AI Synergy in Statistical Arbitrage: Enhancing Robustness Across Volatile Financial Markets

by

Binxu Lei

Research Institute of Economics and Management, Southwestern University of Finance and Economics, Chengdu 611130, China

Risks 2026, 14(3), 63; https://doi.org/10.3390/risks14030063

Submission received: 26 January 2026 / Revised: 5 March 2026 / Accepted: 10 March 2026 / Published: 12 March 2026

(This article belongs to the Special Issue AI-Driven Financial Econometrics and Risk Management)

Download

Browse Figures

Versions Notes

Abstract

This study provides a structured review of statistical arbitrage research in the context of artificial intelligence, with a particular focus on machine learning based methods. The reviewed literature highlights the evolution from linear, rule-based strategies to increasingly complex data-driven models, while also documenting persistent challenges related to tail-risk exposure, regime instability, limited interpretability, and regulatory and governance constraints in practical applications. Building on this literature synthesis, the paper develops a conceptual AI-led, human-in-the-loop statistical arbitrage framework that integrates ML-generated signal modeling with structured human oversight—encompassing risk calibration, discretionary intervention, and interpretability review. This framework resonates with human-AI collaboration systems across other financial domains, collectively supporting the proposition that collaborative systems show potential to enhance resilience compared to purely AI-driven alternatives under specific market stress scenarios. It is positioned as a governance-oriented synthesis that qualitatively extends existing human-in-the-loop concepts by structurally embedding adaptive oversight within the statistical arbitrage decision architecture.

Keywords:

statistical arbitrage; pair trading; risk-adjusted performance; mean reversion; machine learning; CVaR; model interpretability

JEL Classification:

G17; G28; C45; C58

1. Introduction

According to the Efficient Market Hypothesis (EMH), asset prices fully incorporate available information, leaving little room for persistent abnormal returns (Fama 1970). Nonetheless, decades of empirical evidence on statistical arbitrage—especially pair trading based on cointegration and mean-reverting spreads—demonstrate that temporary pricing deviations can and do emerge in practice (Gatev et al. 2006). These deviations, although often short-lived, form the conceptual foundation of systematic arbitrage strategies designed to exploit relative mispricing across related assets.

As a key quantitative finance research direction, statistical arbitrage initially relied on econometric tools like cointegration and mean reversion. While theoretically sound and economically interpretable, these methods lack adaptability in high-dimensional, multifrequency, nonlinear markets. With the improvement of computing power and data availability, artificial intelligence (AI) techniques, particularly machine learning (ML) methods, have been widely introduced into statistical arbitrage research with the aim of depicting complex dependency structures, improving signal recognition ability, and enhancing the dynamic adjustment ability of strategies. Related research has been extended to multiple asset categories and multi-market situations.

After systematically reviewing the relevant literature, it can be found that the introduction of ML methods (a key subset of AI) does not simply address the core problems in statistical arbitrage, but to a certain extent amplifies some long-standing challenges that have not been prominent enough before. Specifically, numerous studies based on ML models, especially deep learning (DL) and reinforcement learning (RL) approaches, perform well in backtesting but exhibit significant performance fluctuations in off-sample environments, amid market structure changes, or during extreme market conditions. Meanwhile, increased model complexity often reduces strategy interpretability, making it difficult to clearly trace signal sources, economic logic, and decision-making paths. In addition, existing research usually regards the trading system as a fully automated decision-making subject, paying insufficient attention to manual judgment, manual intervention, and organizational governance mechanisms, which often play an irreplaceable role in many practical applications.

Although many comprehensive studies have summarized statistical arbitrage methods and their technological evolution in recent years, most reviews still take algorithm types or modeling techniques as the main clues, and lack systematic refinement and integration of recurring problems in different studies. In particular, the role of human experts in model construction, risk control, and abnormal situation handling is often only mentioned as an additional explanation and not included in the unified analytical framework. This makes it difficult for most existing reviews to fully answer a key question: in the context of the growing complexity of statistical arbitrage systems, how ML models and human judgments should be allocated and coordinated.

This paper conducts a structured literature review following PRISMA guidelines, covering 52 core studies retrieved mainly from Web of Science and other databases (1979–2025). We first clarify the development trajectory and core content of statistical arbitrage, then synthesize findings to highlight that statistical arbitrage remains relevant across multiple markets. A synthesis of the existing literature suggests that statistical arbitrage remains an active and evolving research area, particularly as market structures and modeling techniques continue to change. This observation motivates recent conceptual discussions on how human judgment and ML may be jointly incorporated into statistical arbitrage systems.

Drawing on the key findings from the above literature review, this paper advances a conceptual human–AI collaborative framework for statistical arbitrage—distinct from the synthesized literature itself—in which ML functions as the primary signal-generation component under structured human oversight. In the literature synthesized by this study, algorithmic models are typically positioned as the primary engines for signal generation and execution, while human intervention is discussed as a governance mechanism for risk threshold calibration, abnormal-signal review, and strategy-logic validation. It should be emphasized that the framework proposed in this study is not the empirical realization of a specific trading system, but a systematic integration of the scattered human-AI collaborative ideas in existing research to provide a unified perspective with analysis and discussion value.

The contributions of this study are primarily conceptual and integrative in nature and can be summarized along four dimensions.

First, rather than adopting a conventional technique-based taxonomy, this paper reorganizes the statistical arbitrage literature from a structural and problem-oriented perspective. By synthesizing recurring weaknesses—such as regime instability, tail-risk exposure, overfitting bias, and interpretability constraints—across both classical econometric and ML-driven paradigms, it shifts the analytical focus from incremental model improvement to systemic vulnerabilities embedded in automated trading architectures.

Second, the study integrates methodological rigor, robustness diagnostics, and governance considerations into a unified evaluative framework. Through a structured quality-scoring system and cross-study comparison, it moves beyond descriptive surveying and provides a transparent basis for assessing the empirical credibility of existing strategies.

Third, this paper explicitly formalizes the role of human oversight within statistical arbitrage systems. Instead of treating discretionary intervention as an ex post safeguard, it conceptualizes human judgment as an adaptive governance layer that interacts dynamically with algorithmic signal generation, particularly under structural breaks and extreme market conditions.

Fourth, building deductively on the documented limitations of purely automated systems, the study proposes a human–AI collaborative framework that emphasizes resilience and operational robustness rather than purely return maximization. The framework consolidates dispersed discussions of interpretability, risk calibration, and regulatory alignment into a coherent analytical architecture, thereby offering a governance-oriented reference model for future empirical implementation.

2. Theoretical Foundations and Core Issues

In its traditional definition, statistical arbitrage was initially defined as an arbitrage strategy based on quantitative methods and statistical models. Its core logic traditionally lies in exploiting the mean-reversion characteristics of price series to capture temporary mispricing (Gatev et al. 2006). This traditional definition of statistical arbitrage is grounded in the premise of limited failures of the weak-form efficient market hypothesis—markets may deviate from fundamental-based pricing in the short term but tend to revert to the mean over the long term. Its typical methodology follows the “mean reversion-cointegration-pair trading” paradigm: cointegration tests identify assets with long-term equilibrium relationships, and pair trades are executed when their spread deviates from historical levels—going long on undervalued assets and short on overvalued ones to capture profits from mean reversion (Elliott et al. 2005; Engle and Granger 1987). Formally, the simplest two-asset pair trade can be expressed as constructing a spread,

s_{t} = p_{1, t} - β p_{2, t}

(or logarithmic spread), assuming this spread follows a stationary mean-reverting process. In the continuous limit, it is often modeled as an Ornstein–Uhlenbeck process:

d s_{t} = - κ s_{t} d t + σ d w_{t}

$κ$ : The mean reversion speed coefficient ( $κ$ > 0); a larger value indicates a faster rate at which the spread reverts to its long-term equilibrium level;
$σ$ : The volatility parameter, measuring the degree of uncertainty in spread changes.

The speed of mean reversion κ > 0 ensures the recoverability of deviations, thereby generating “realizable arbitrage signals.” This modeling approach forms the theoretical foundation of classical statistical arbitrage, focusing on mean reversion and adopted by a series of modeling studies. Traditional statistical arbitrage models primarily rely on linear econometric methods and stationarity tests, emphasizing the dynamic process of market imbalance followed by reversion to the mean. This provided the foundational framework for early quantitative hedge fund strategies.

With the continuous development of modern markets, market structures have become progressively complex, and investor behavior has grown more diverse. Consequently, research and practice in statistical arbitrage have gradually moved beyond the classical paradigm of “mean reversion-cointegration-pair trading,” giving rise to several modern extensions. The original statistical arbitrage was built upon the assumption of highly rational investors and did not overemphasize the analysis of human psychology. Contemporary scholars, however, have introduced behavioral finance perspectives into traditional theories, offering insights to mitigate the “human blind spots” overlooked by conventional finance. Relevant researchers argue that investors’ cognitive biases and bounded rationality may lead to the alternating presence of momentum and reversal effects (Jegadeesh and Titman 1993). Within this framework, statistical arbitrage no longer relies solely on mean reversion assumptions but simultaneously accounts for trend persistence, forming a dual mechanism of “mean reversion-momentum.” Furthermore, modern researchers emphasize the significant impact of market frictions (transaction costs, liquidity constraints, short-selling restrictions, etc.) on the feasibility of statistical arbitrage (Avellaneda and Lee 2010; Jarrow et al. 2012). Unlike the frictionless markets in earlier models, modern statistical arbitrage models often require dynamic optimization and risk control methods to comprehensively account for execution delays, impact costs, and leverage constraints, ensuring the strategy’s effectiveness in practice. Furthermore, with the rapid growth of computational power in this information era, statistical arbitrage has progressively integrated high-frequency trading and ML techniques into its practical strategies.

2.1. The Economic Significance of Statistical Arbitrage

Beyond its usage in generating profits in financial markets, the existence of statistical arbitrage also has a profound influence on the study of economics. According to Fama’s classic definition (Fama 1970), the EMH posits that financial markets accurately and unbiasedly reflect all available information, which means systematic excess returns are actually unattainable. However, empirical evidence from statistical arbitrage demonstrates that quantitative methods can consistently identify and exploit mispricing across assets, yielding substantial risk-adjusted returns (Balladares et al. 2021; Hogan et al. 2004). This implies that market price movements are not entirely random but exhibit short-to-medium-term deviations and predictability, sparking ongoing debates about the validity of weak-form and semi-strong-form market efficiency. Scholars hold competing perspectives: some argue such mispricing reflects temporary inefficiencies, while others contend it compensates for unaccounted risks (Hogan et al. 2004). While some studies (Gatev et al. 2006) interpret statistical arbitrage profits as evidence of EMH violations, others argue these returns compensate for model risk or liquidity constraints (Hogan et al. 2004). In short, statistical arbitrage uses quantitative models to expose systemic market inefficiencies rather than isolated anomalies. This not only fuels discussions about market efficiency limitations but also provides rich empirical evidence for the subsequent rise in behavioral finance and market friction theories. From an economic perspective, the significance of statistical arbitrage lies in revealing the actual existence of information asymmetry and investor bounded rationality in financial markets, thereby driving a re-examination and revision of market efficiency theory.

2.2. Key Issues and Controversies

In the preceding discussion, statistical arbitrage has been defined as a systematic trading approach based on the mean reversion hypothesis, capturing pricing deviations with pair trading as its typical strategy. On the other hand, its scope has expanded beyond the traditional “spread reversion framework.” In modern extensions, behavioral finance and market frictions have been incorporated, which emphasize the roles of investor bounded rationality, liquidity friction, and risk compensation. However, regardless of the perspective taken, statistical arbitrage faces controversies both academically and in practical application.

First is the persistence of mean reversion: As the core of traditional statistical arbitrage theory, mean reversion does not consistently hold across all market conditions and time horizons in practice. In the short term, delayed price reactions and overshooting can create arbitrage opportunities (Jegadeesh and Titman 1993). However, over the long term, long-term memory and structural risk may distort or even delay the mean reversion process. Prices may remain elevated for extended periods or fail to converge toward the mean, rendering the strategy ineffective (Hong and Stein 2007; Ramos-Requena et al. 2020; Serrano and Hoesli 2012). This implies that mean reversion may not exist in the long run, as it is just a localized, temporary phenomenon. Whether statistical arbitrage can sustain itself through such strategies in the long run remains questionable to us.

Second, in terms of return effectiveness, as modern market efficiency advances and investors grow more sophisticated, arbitrage opportunities have diminished steadily year on year. Traditional statistical arbitrage strategies, such as simple pair trading, have shown clear signs of weakness in many practical applications, with their returns converging gradually over time. In their systematic tests of pair trading, Do and Faff (2010) found that the performance of many strategies is highly dependent on the sample period and market conditions in place, which makes it difficult to achieve robust returns across different time periods. Stephenson et al. (2021) further argued that simple pair trades without further optimization struggle to sustain stable and significant returns in the current market environment. Therefore, more innovation should be found to fuel statistical arbitrage.

To establish a rigorous theoretical position, this study defines statistical arbitrage robustness as the strategy’s ability to sustain positive risk-adjusted returns across regime shifts and structural breaks, rather than merely fitting historical data. Theoretically, this construct extends the weak-form EMH (Fama 1970) by arguing that exploitable mispricing (Gatev et al. 2006) is conditional on stable market structures. Thus, robustness is not a technical afterthought but a necessary theoretical condition for the existence of viable statistical arbitrage in dynamic financial markets.

3. Methodology

This study employs a multi-stage systematic review and conceptual synthesis protocol aligned with the PRISMA 2020 Statement for systematic reviews. The methodology is designed to move beyond descriptive summary, employing a critical lens to evaluate the evolution of statistical arbitrage from pure econometrics to AI integration, with rigorous definitions of search strategies, eligibility criteria, reliability assessment, and exclusion principles to ensure the scientificity and reproducibility of the review process.

3.1. Search Strategy and Eligibility Criteria and Exclusion Principles

3.1.1. Search Strategy

A systematic bibliographic search was conducted across core indexing services—Web of Science (Core Collection), Scopus, ACM Digital Library, supplemented by SSRN for recent advancements in quantitative finance. The search period was set as 1979–2025 to cover the origin and latest development of statistical arbitrage research.

Boolean search strings for the convergence of three research dimensions were designed as follows:

Dimension 1 (Arbitrage theory): (“statistical arbitrage” OR “pair trading” OR “mean reversion” OR “cointegration”)
Dimension 2 (Computational modeling): (“machine learning” OR “deep learning” OR “reinforcement learning” OR “LSTM” OR “Attention mechanism” OR “AI” OR “artificial intelligence”)
Dimension 3 (Risk governance and application): (“risk management” OR “CVaR” OR “VaR” OR “model interpretability” OR “human-AI synergy” OR “statistical arbitrage application”)

The final combined search string: ((“statistical arbitrage” OR “pair trading” OR “mean reversion” OR “cointegration”) AND (“machine learning” OR “deep learning” OR “reinforcement learning” OR “AI”) AND (“risk management” OR “CVaR” OR “model interpretability” OR “human-AI synergy”)).

Additional manual retrieval was performed for the reference lists of core included studies to avoid missing high-relevance literature.

3.1.2. Eligibility Criteria

Inclusion criteria for the literature were strictly defined as:

Thematic relevance: Focus on the theoretical construction, model improvement, or practical application of statistical arbitrage, and involve the integration of AI/ML methods or human-AI collaborative governance in statistical arbitrage;
Methodological rigor: Contain quantitative empirical analysis, backtesting verification, or model construction, with a clear description of the research design and data sources;
Literature type: Peer-reviewed journal articles, working papers with complete research content (SSRN); exclude conference abstracts with incomplete content;
Language: Only English literature was included to ensure the accuracy of content analysis.

3.1.3. Exclusion Principles

Literature was excluded if it met any of the following criteria:

Off-topic research: Focus on high-frequency market making, pure algorithmic trading, or traditional quantitative investment without statistical arbitrage core components;
Non-academic content: Market commentaries, industry white papers, technical reports without rigorous academic analysis and empirical verification;
Irrelevant technical research: Pure machine learning/AI algorithm research without financial statistical arbitrage application scenarios;
Incomplete research: Literature with missing key research content (e.g., no model design, no empirical results, unclear data sources);
Duplicate publications: The same research content is published in different journals/conferences, with only the most complete version included.

Each identified study was mapped into a Taxonomy of Algorithmic Complexity, allowing for a structured comparison between high-interpretability/low-adaptability models (classical VECM) and low-interpretability/high-adaptability models (black-box ML model).

3.2. Study Quality, Reliability Assessment, and Methodological Scoring System

Only high and medium-reliability studies are included in subsequent analysis. Each selected paper is systematically evaluated along multiple technical dimensions that are critical for the robustness and reproducibility of statistical arbitrage strategies.

Specifically, five core dimensions are considered: (i) data adequacy, (ii) model specification and validation, (iii) robustness and stability analysis, (iv) risk management and performance evaluation, and (v) interpretability and economic rationale. These dimensions reflect both classical econometric requirements and emerging challenges in ML–based trading systems.

The scoring criteria are closely linked to the reliability of research design, and the detailed criteria are summarized in Table 1 (optimized to clarify the quantitative and qualitative standards of each score level). For each dimension, a discrete scoring scale ranging from 0 to 2 is employed, where 0 indicates insufficient or missing methodological treatment, 1 represents partial or basic implementation, and 2 denotes comprehensive and rigorous execution.

Data Adequacy (D): This dimension evaluates whether the study employs sufficiently long and representative samples, including multiple market regimes, and whether data preprocessing procedures such as normalization, filtering, and outlier treatment are properly documented. Studies using limited samples or lacking transparency in data construction receive lower scores.

Model Specification and Validation (M): This criterion assesses the clarity of model formulation and the rigor of validation procedures. Particular attention is paid to the use of out-of-sample testing, walk-forward analysis, cross-validation, and parameter stability checks. Studies relying solely on in-sample fitting are penalized.

Robustness and Stability Analysis (R): This dimension examines whether the proposed strategies are tested under alternative parameter settings, market conditions, and structural breaks. Sensitivity analysis, regime-switching evaluation, and stress-testing procedures are regarded as indicators of methodological robustness.

Risk Management and Performance Evaluation (P): This criterion evaluates the comprehensiveness of risk control mechanisms and performance metrics. Studies are assessed based on their incorporation of volatility modeling, drawdown analysis, tail-risk measures, and realistic transaction cost assumptions.

Interpretability and Economic Rationale (I): This dimension measures the extent to which the trading signals and model outputs are economically interpretable and theoretically justified. Greater weight is assigned to studies that provide clear links between statistical signals, market microstructure, and economic mechanisms.

For each study j, an overall methodological quality score

Q_{j}

is constructed as a weighted aggregate of the five dimensions:

Q_{j} = \sum_{k \in \{D, M, R, P, I\}} w_{k} \cdot S_{j, k},

where

S_{j, k}

denotes the score of study j on dimension k, and

w_{k}

represents the corresponding weighting coefficient. In the baseline analysis, equal weights are assigned (

w_{k}

= 0.2) to avoid subjective bias, while alternative weighting schemes are examined in robustness checks.

All studies are independently evaluated following this framework. In cases of ambiguous classification, assessments are cross-checked to ensure consistency. The resulting quality scores are subsequently used to guide the comparative analysis and framework construction in Section 4 and Section 5.

3.3. Methodological Rigor and Robustness Assessment

The synthesis does not merely report results but critically evaluates the “methodological depth” of the included studies. Each work underwent a technical audit focusing on:

Stationarity and Cointegration Stability: Evaluation of how models manage the degradation of cointegrating relationships during market turbulence.

Overfitting and Data Snooping Diagnostics: Scrutinizing the use of walk-forward validation and penalized loss functions (e.g., L1/L2 regularization) to ensure statistical validity.

Regime-Switching Capability: Assessing whether models incorporate exogenous features or structural break tests (e.g., Bai-Perron test) to adapt to non-linear market shifts. By identifying where purely automated models fail to account for these nuances, this section establishes the empirical gap that necessitates human intervention.

3.4. Synthesis and Framework Derivation

The transition from review to the Human-AI Synergy Framework is governed by a Deductive Synthesis approach. Rather than a heuristic suggestion, the framework is conceptualized as a “Governance Layer” derived from the failure modes documented in the literature. The methodology follows a three-step logic: (1) Identifying the Technical Failure Points of ML models in statistical arbitrage; (2) Determining the Human Cognitive Strengths (e.g., context-aware risk judgment) that mitigate these points; and (3) Designing a Synergistic Loop that formalizes these interactions. This ensures the proposal is an intellectually solid extension of the literature review, grounded in the documented needs for adaptive oversight.

3.5. The PRISMA Flow

Next, I will present the PRISMA flow diagram that I specifically used in this study. The flow strictly abides by the PRISMA 2020 Statement, and the number of literature excluded at each stage is accurately quantified in accordance with the aforementioned inclusion and exclusion criteria, to ensure the transparency, reproducibility, and reliability of the literature screening process.

Records identified from:

Web of Science (n = 125)

Scopus (n = 142)

IEEE Xplore (n = 78)

SSRN (n = 110)

Records removed before screening:

Duplicate records removed (n = 135)

Screening

Records screened by Title and Abstract (n = 320)

Records excluded (n = 220)

Reasons:

Focus on high-frequency market making without statistical arbitrage components (n = 115)

Non-academic market commentaries or white papers (n = 65)

General machine learning papers without financial application (n = 40)

Eligibility

Reports sought for retrieval (n = 100)

Reports assessed for eligibility (n = 100)

Reports excluded (n = 48)

Reasons:

Inadequate quantitative backtesting or empirical validation (n = 21)

Insufficient disclosure of model architectures or hyperparameters (n = 14)

Negligence of operational risk and human governance factors (n = 13)

Included

Total studies included in the systematic review (n = 52)

Application:

Synthesis of Econometric and AI evolution (n = 28)

Identification of black-box failure modes (n = 12)

Conceptual derivation of the Synergy Framework (n = 12)

4. Literature Synthesis: From Traditional Econometrics to AI Integration

The previous sections thoroughly discussed the theoretical basis and core contradictions of statistical arbitrage. In the long run, these characteristics remain almost unchanged. In recent years, they have changed little and evolved slowly. On the contrary, the iterative development of statistical arbitrage methods is closely related to the transformation of the financial market structure, the breakthrough of measurement technology, and the progress of computing power. In this era of rapid artificial intelligence and information technology, its evolution is rapid and changing. From a comprehensive academic and practical perspective, this evolution can be divided into two main stages: the era of pure classical econometric models (1980s–2000s) and the era with the integration of AI (primarily machine learning) and econometrics (2010s–now). The era of classical econometric models is based on linear hypotheses and structured modeling frameworks, while the extension with ML broke the previous technical boundaries through large-scale, multi-dimensional data-driven nonlinear learning. These two stages have jointly (As showed in Figure 1) promoted the return of statistical arbitrage from simple computation to a multi-dimensional intelligent decision-making model.

4.1. Statistical Arbitrage in the Era of Classical Econometric Model

The core logic of statistical arbitrage in the era of classical econometric models can be generally summarized as: “Identify stable statistical relationships between assets, model the mean-reversion process of spreads, and design arbitrage signal triggering mechanisms.” Its technical framework centers primarily on these three key breakthroughs. The literature on classical statistical arbitrage commonly describes a sequence of modeling steps, including the following six steps: Data Preprocessing, Statistical Relationship Identification, Factor and Dimension Optimization, Dynamic Forecasting and Signal Capture, Volatility and Risk Control, and Deepening Multivariate Dependencies. Through ongoing refinement via methodologies such as econometrics and mathematics, the strategy ultimately satisfies three core criteria: statistical significance, out-of-sample robustness, and adaptability to market friction. In the sections that follow, these six steps will be deconstructed in sequence in line with technical logic, elaborating on how robust technical methods boost the rationality and effectiveness of quantitative strategies—thereby facilitating more straightforward implementation and enabling the achievement of meaningful returns.

Classic statistical arbitrage models emerged from the 1970s’ “quantitative revolution” and econometrics breakthroughs. Academic groundwork was laid in U.S. university econometrics laboratories (e.g., MIT, UC San Diego), with practical application in Wall Street quantitative hedge funds. The U.S. stock market had already developed relatively sophisticated trading mechanisms, yet investor sophistication remained relatively low (primarily relying on fundamental analysis). Coupled with the rapid advancement of computing power, these factors provided crucial conditions for the successful implementation of statistical arbitrage. The development of classic statistical arbitrage models during this period was primarily established by leading econometricians, such as Engle and Granger (1987) and Johansen (1988). Their research laid the cornerstone for the traditional econometric era. The following discussion summarizes representative modeling components frequently described in the literature, rather than proposing a complete or implementable trading architecture. Due to space limitations, I have placed the specific mathematical formulas of the methodologies mentioned in this stage in Appendix A.

4.1.1. Data Preprocessing

Data preprocessing is a foundational step in constructing statistical arbitrage strategies, as it addresses critical data quality issues. The priority of data preprocessing lies in systematically addressing the four critical inherent flaws in financial market data: non-stationarity, outlier interference, dimensional heterogeneity, and high-frequency noise pollution. Failing to resolve these flaws will significantly impact the subsequent construction of investment strategies. Therefore, data preprocessing must adhere to principles of statistical rigor and financial market applicability, enhancing data quality through multiple steps. Core data preprocessing steps—cleaning, stationarity adjustment, normalization, and noise reduction—ensure data quality and lay a foundation for model construction. Outlier detection (e.g., MAD) and missing value imputation (e.g., K-nearest neighbor interpolation) address key data flaws, with stationarity tests (e.g., ADF) resolving spurious regression risks (Rousseeuw and Croux 1993; Troyanskaya et al. 2001; Dickey and Fuller 1979).

Its core logic involves using the information from the K most similar “complete samples” to the sample with the missing value to estimate the missing value. This approach significantly resolves the issues caused by data missingness.

Stationarity: As the name suggests, data stationarity addresses the issue of “non-stationarity” in data. Non-stationary sequences can cause spurious regression in traditional econometric models, undermining their reliability. One core method involves using the ADF test to determine whether a time series contains a “unit root,” thereby establishing its stationarity (Dickey and Fuller 1979).

Standardization and Noise Reduction: In statistical arbitrage data preprocessing, “standardization” and “noise reduction” are two inseparable and closely interconnected processes, hence why they are introduced together here. Standardization primarily addresses weighting biases caused by inconsistent units of measurement, while noise reduction eliminates interference from high-frequency data. Together, they ensure the accuracy of subsequent modeling steps (e.g., factor screening, similarity calculations). The most typical method is Z-score standardization.

For high-frequency data denoising, the most commonly used method is the Kalman filter for dynamically estimating true prices (De Moura et al. 2016). Its core logic involves dynamically iterating to estimate the optimal true price through “state equations” and “observation equations.”

4.1.2. Statistical Relationship Identification

The core of statistical arbitrage lies in recognizing quantifiable mean-reversion asset correlations, that is, identifying the possible statistical relationships between assets. Only assets exhibiting these characteristics can ensure a strategy’s sustainable profitability. Simultaneously, statistical relationships must undergo rigorous validity testing to prevent overfitting or in-sample coincidental strategies that could cause the strategy to collapse in actual trading. In the era of classical models, we often capture the linear relationships between assets.

Classic Linear Relationship (Cointegration): To test whether two assets exhibit a long-term cointegration relationship, the Engle-Granger two-step method (Engle and Granger 1987) is typically employed.

4.1.3. Factor and Dimension Optimization

After completing the basic works, the next step is naturally to customize and optimize the model. In statistical arbitrage, this step corresponds to optimizing factors and dimensions. In this phase, we focus on two core directions: “screening effective factors” and “streamlining dimensions.” This approach allows us to extract strong, uncorrupted core signals from messy data, enabling subsequent models to operate with greater precision and deliver more reliable outcomes.

Factor Screening: Grinold and Kahn (2000) proposed using the Information Content (IC) to screen high-information factors (Vergara and Kristjanpoller 2024).

Dimension Reduction: In the dimension reduction process, PCA (Caneo and Kristjanpoller 2021) is commonly used for linear relationships. The core of PCA dimension reduction involves calculating the covariance matrix of the data, solving for its eigenvalues and eigenvectors, then selecting principal components that collectively explain over 80% of the cumulative variance. For non-linear relationships, t-SNE is frequently employed. Consequently, this achieves dimensionality reduction for high-dimensional nonlinear data.

4.1.4. Dynamic Forecasting and Signal Capture

After completing the preliminary work, dynamic forecasting and signal capture constitute the subsequent analytical phase based on optimized core features. This represents the critical stage for strategy implementation, achieving the core transition from “static rules” to “intelligent decision-making.” The following sections will break down this crucial phase in detail.

In classical static signal components, cointegration spreads from its core foundation. The specific methodology generally involves first identifying financial assets with long-term correlations, then calculating the price spreads between them. Then, based on the mean and volatility of the spread, a fixed threshold is set (typically one and a half times the standard deviation above or below the mean). When the spread breaches this threshold, an arbitrage signal is triggered (Elliott et al. 2005). This approach implements relatively stable static rule-based trading based on the core principle that “spreads will eventually revert to the mean.”

4.1.5. Volatility and Risk Control

For a quantitative strategy to be valuable, its profitability must be sustainable. Precise assessment of market volatility and effective mitigation of extreme risks are central to achieving this sustained profitability. As a critical step for stable returns in quantitative strategies, this process provides robust assurance for strategy implementation through quantitative risk measurement and dynamic portfolio rebalancing. Regarding quantitative risk and risk management, this section will delve into two key aspects: risk measurement and dynamic control.

Risk Measurement: In the process of quantifying risk, volatility modeling is a commonly used approach. The most typical methods involve characterizing volatility through GARCH models and SV models.

Several studies incorporate CVaR both as a performance metric and as an operational constraint in statistical arbitrage and portfolio optimization settings, aiming to reduce tail-risk exposure at execution (Rockafellar and Uryasev 2002; Chow et al. 2018).

Several studies propose robustness checks for machine-learning-based statistical arbitrage implementations, including (a) nested cross-validation for hyperparameter tuning and unbiased performance evaluation; (b) rolling-window re-estimation (e.g., a three-year window with a monthly step) to assess stability across different market regimes; (c) block bootstrap procedures (for example, 20-day blocks with at least 1000 resamples) to construct confidence intervals under serial dependence; (d) Deflated Sharpe Ratio adjustments to correct for multiple testing and potential non-Gaussian return distributions; and (e) explanability and feature-sanity checks (such as SHAP or permutation importance), whereby signals are considered reliable only when their primary drivers have plausible economic interpretations.

Together, these procedures provide a structured approach for improving the reliability and practical credibility of ML-based trading strategies.

4.1.6. Deepening Multivariate Dependencies

In the preceding steps, we identified relationships among assets and completed optimization. Several strands of the literature extend classical pairwise analysis by modeling nonlinear, dynamic, and networked dependencies among multiple assets, thereby aiming to capture more realistic cross-asset interactions. This enables us to depict the interconnections between assets more realistically, thereby enhancing the accuracy of our strategies. We will analyze this process in two distinct parts: classical dynamic dependencies and modern network dependencies.

Classic dynamic dependencies are captured via the DCC-GARCH method (Engle 2002), which estimates real-time cross-asset correlations by first modeling individual asset volatility and then constructing a time-varying covariance matrix to reflect dynamic market relationships.

Set two parameters to control the weighting of new information versus historical data, enabling the matrix to dynamically adjust over time. Subsequently, standardize this pseudo-variance matrix to obtain a dynamic correlation coefficient matrix. Each element within this matrix represents the correlation coefficient between the corresponding two assets at the current time step.

Modern Network Dependencies: When modeling the nonlinear, asymmetric, and network-based dependencies between assets that are influenced by other assets, we often employ the Pair-Copula method. Through Pair-Copula decomposition, we ultimately derive a joint distribution model for multiple assets—that is, a complete mathematical model capturing the collective fluctuations of all assets. Simultaneously, we can uncover specific dependency details—such as identifying core nodes and determining which assets influence the correlation between any two assets. This yields a clear, traceable network.

4.1.7. Summary of This Era

The era of classical econometric models for statistical arbitrage provided systematic tools for identifying arbitrage opportunities through their rich and rigorous modeling techniques. However, constrained by the ability to capture nonlinear complex relationships, classical econometrics and ML-based methods complement each other in modern statistical arbitrage. To sum up, this era laid the foundation for the long-term development of statistical arbitrage and offered abundant insights for future generations. However, these linear frameworks theoretically lack resilience to nonlinear regime changes, limiting their robustness outside of stable market conditions.

4.2. The Extension in the Era of AI

Over the past two decades, AI breakthroughs have reshaped financial market operating logic. Driven by exponential computing power growth and ML algorithm advancements, statistical arbitrage has integrated ML and other AI techniques. Early statistical arbitrage relied on manually defined variables and linear hypothetical frameworks—analogous to measuring a winding river with a straight ruler—making it challenging to capture nonlinear market fluctuations; while AI technology realizes the autonomous mining of multi-dimensional data integration and nonlinear relationships through neural networks and deep learning architecture, so that arbitrage strategies can break through traditional boundaries.

Current ML-driven statistical arbitrage systems offer three core advantages. First, when processing high-dimensional financial data, ML models exhibit pattern recognition capabilities that far surpass traditional linear methods. Via algorithms including neural networks, gradient boosting trees, and random forests, the system automatically captures nonlinear variable interactions and potential structures. This enables the identification of predictive arbitrage signals within massive asset datasets. Krauss et al. (2017) compared the performance of deep neural networks, gradient boosting trees, and random forests in statistical arbitrage with S&P 500 stocks as a sample. The results show that the deep learning model is significantly better than the traditional strategy in terms of revenue forecasting accuracy and economic performance. This discovery suggests the potential effectiveness of ML in complex market signal recognition and provides a new technical path for the signal construction of statistical arbitrage.

Secondly, in statistical arbitrage, the dynamism and timing characteristics of signals are particularly critical. Traditional methods usually assume the linear regression or static mean regression relationship between returns and spreads, while deep learning models based on structures such as LSTM and Transformer can better capture the nonlinear dynamics and delay effects of time series. The revolutionary aspect of LSTM lies in its Memory Cell and three core Gates—Forget Gate, Input Gate, and Output Gate. These Gates work collaboratively to achieve the “filtering, updating, and outputting” of historical information, ultimately capturing nonlinear dependencies.

Flori and Regoli (2021) found that after embedding the LSTM model into the paired trading framework, the profitability and stability of the strategy are significantly improved, especially under the condition of controlling transaction costs and risk exposure. This study verifies the advantages of the deep sequence network in predicting the dynamics of asset spreads and optimizing the timing of entry and exit, and provides important methodological support for the construction of a dynamic statistical arbitrage system. It directly shows that relying on ML techniques to capture nonlinear complex relationships has a significant effect on improving the profitability of statistical arbitrage models.

Moreover, with the expansion of the scale and dimension of financial market data, the parallel computing power of ML algorithms enables statistical arbitrage to expand from single asset pairing to multi-asset and even cross-market combination strategies. Huck (2019) shows the feasibility of applying ML models to large-scale data sets for arbitrage signal screening and combination optimization. The study pointed out that the use of ML algorithms for feature screening and nonlinear mapping not only improves the scalability of strategies but also significantly enhances the ability to identify arbitrage opportunities in high-dimensional space. This means that the ML-driven statistical arbitrage system can maintain robustness under large sample conditions, laying the foundation for institutional investors to achieve systematic and large-scale deployment.

Although AI has improved the intelligence of strategy generation, its strong fitting ability also brings a serious risk of backtesting overfitting and selective bias. ML models often perform large-scale parameter search and feature optimization in historical data. This kind of “data mining” training is very likely to lead to a significant attenuation of strategies out-of-sample. The “Deflated Sharpe Ratio” proposed by Bailey and Lopez De Prado (2014) provides a correction framework for quantitative strategy performance evaluation to correct the exaggerated returns caused by multiple tests and non-normal yield distribution. They pointed out that without statistical correction of the overfitting, the model may be completely invalid in actual investment. This challenge is particularly critical to ML-driven statistical arbitrage because the higher the complexity of the model, the greater the risk of overfitting. Therefore, future research needs to introduce stricter cross-verification mechanisms, rolling sample testing, and robustness testing in the model training stage to ensure the mobility and stability of strategies in the real market. Below, we are going to talk about the four most commonly used ML methods, analyzing how ML contributes to the development of statistical arbitrage.

4.2.1. Supervised Learning and Unsupervised Learning

Supervised learning employs labeled datasets to train models, enabling them to learn the mapping relationship between inputs and outputs (Krauss et al. 2017). In the research and practice of statistical arbitrage, the core value of supervised learning is reflected in capturing complex nonlinear patterns, thus improving the accuracy of arbitrage signals. Supervised learning model such as random forest and deep neural networks, can not only effectively make up for the limitations of traditional simple linear models, but also identify more complex price dynamic characteristics. By improving the ability to identify nonlinear price movements, this kind of ML method enhances the accuracy of the signals and reduces the noise interference significantly, which makes the strategy much more robust. Moreover, supervised learning methods can also maintain good generalization capabilities in the continuously updated data flow. Unlike supervised learning, unsupervised learning requires no pre-labeled data and uncovers latent asset structures independently (Han et al. 2023). In statistical arbitrage, we use PCA for linear dimension reduction (retaining principal components explaining ≥ 80% of variance) and t-SNE for nonlinear clustering, identifying hidden cross-asset correlations to inform multi-asset strategies. When constructing a multi-asset arbitrage strategy, unsupervised learning has been proven to be effective in uncovering hidden interdependencies across large scopes of assets. It provides basic evidence of correlations that support statistical arbitrage. In fact, this ML method helps us save a lot of human resources. With the supplementary methods provided by this, the process of constructing a strategy will be even more accurate.

4.2.2. Reinforcement Learning

Reinforcement learning primarily operates through constructing an “environment-agent-reward” interaction framework, which is intended to support agents in continuously optimizing decision strategies through ongoing interaction with dynamic environments (Jaimungal et al. 2022). In ML-based statistical arbitrage, reinforcement learning operates via an “environment-agent-reward” framework (Coache et al. 2023): the environment includes market volatility and liquidity, the agent decides position sizing and entry/exit timing, and the reward function integrates Sharpe Ratio and CVaR—enabling real-time position adjustments and adaptive profit-taking/stop-loss optimization. It allows strategies to better adapt to market fluctuations and could contribute to a more stable investment performance.

In reinforcement learning-based trading frameworks, the portfolio decision process is commonly formulated as a Markov Decision Process (MDP), where the agent observes a market state

s_{t}

, selects an action

a_{t}

, and receives a reward

r_{t}

. The optimal action-value function is defined by the Bellman optimality equation.

4.2.3. Deep Learning

In the ML-driven statistical arbitrage system, deep learning models (especially LSTM) are widely used to predict spread changes and potential mean return opportunities between assets. Unlike traditional linear regression, LSTM is designed to capture nonlinear dynamic relationships in the time dimension, and has shown this capability in relevant empirical contexts. Its core idea is to retain or forget historical information through the gate structure, thus forming a nonlinear mapping of future price trends. A simplified prediction function can represent:

{\hat{y}}_{t + 1} = f (x_{t}; θ) = σ (w x_{t} + b)

In the function,

x_{t}

is the input feature (such as spread, yield, turnover),

w

and b are model parameters, and

σ (\cdot)

is the nonlinear activation function (commonly used as sigmoid or tanh). The function learns the mapping relationship through the training sample, so that the output

{\hat{y}}_{t + 1}

is used as a prediction of the direction or magnitude of the next spread. The training target of the model is usually in the form of minimized mean square error (MSE).

Fischer and Krauss (2018) verified the effectiveness of this method in the empirical study of the European Journal of Operational Research. They found that the statistical arbitrage strategy built by LSTM’s directional prediction of S&P 500 stocks has significantly higher risk-adjusted returns than traditional methods, which shows that deep learning has outstanding application value in capturing nonlinear market structure and timing dependencies.

4.2.4. Summary of the Extension in This Era

We can briefly conclude the development of this era with three “from” and three “to”, that is, from linear to nonlinear approaches; from rule-driven to data-driven methodologies; and from single-asset analysis to multi-market coordination. If looking ahead, the future trajectory of statistical arbitrage may focus on three key areas: leveraging multimodal data fusion to transcend the limitations of traditional financial data dimensions, establishing cross-market coordination mechanisms, and most critically, enhancing the interpretability of ML-driven statistical arbitrage to meet regulatory requirements. Advancements in these domains will empower future statistical arbitrage strategies with heightened profitability and market compliance, making them more competitive in the financial market. Despite improved predictive power, these models introduce a new theoretical tension: increased complexity often comes at the cost of parameter stability, undermining out-of-sample robustness (Bailey and Lopez De Prado 2014).

5. Literature Synthesis: Cross-Market Performance

Statistical arbitrage, as an important and decisive strategy within quantitative investing, is mostly about capturing pricing errors systematically to generate returns. However, financial markets vary widely in terms of their trading mechanisms, level of development, and investor sophistication. For instance, stock markets have a long history and relatively mature trading systems; cryptocurrency markets, being newer, exhibit extreme volatility, while derivatives markets feature complex products with uniquely specialized pricing logic. These differences necessitate distinct strategies when implementing statistical arbitrage across different markets. Next, we will analyze the performance of statistical arbitrage across its five primary application arenas based on real-world outcomes: the stock market, foreign exchange market, cryptocurrency market, derivatives market, and the specific manifestations of statistical arbitrage within ETF and LETF markets.

5.1. Stock Market

As a core traditional financial market, the stock market offers a broad, typical setting for statistical arbitrage due to its mature trading mechanisms and ample liquidity, with pair trading as the most classic and widely used strategy. Past research has confirmed that in strategies targeting S&P 500 constituents within the U.S. stock market, pair trading strategies constructed based on stock price correlations can achieve annualized returns of 9% to 12% when the S&P 500 cannot rise consistently (Gatev et al. 2006). These findings are often interpreted in the literature as evidence that mean-reversion-based pair trading has exhibited economically meaningful performance under certain market conditions (Elliott et al. 2005; Gatev et al. 2006). Furthermore, from a market micro perspective, the liquidity premium most prominently manifested in stock markets exerts a double-edged sword effect on statistical arbitrage. Menkveld’s (2013) research indicates that while high-frequency trading intensifies short-term volatility in stock markets, it simultaneously accelerates the pace of price deviation and reversion. This indirectly increases the frequency of arbitrage opportunities in stock markets, creating greater potential for statistical arbitrage. As the market where statistical arbitrage finds its widest application, it offers high operability and the greatest profit potential. With the continuous integration of ML methods into statistical arbitrage, the literature generally suggests that equity markets remain an important empirical setting for statistical arbitrage research.

5.2. FX Market

Regarding arbitrage in the foreign exchange market, the most typical form is triangular arbitrage. This involves exploiting the exchange rates between three currencies to earn profits through multiple currency exchanges. But one big problem in this market is that arbitrage opportunities in the forex market tend to vanish quickly and are susceptible to policy interventions and transaction costs, making it relatively difficult to consistently capture such opportunities and achieve stable profits (Ciacci et al. 2020). Nevertheless, with the significant enhancement of modern computing power in foreign exchange markets, statistical arbitrage in foreign exchange markets continues to hold a prominent position. It is likely to remain a key application area for statistical arbitrage, given ongoing advancements in computing power.

5.3. Crypto Market

Relative to stock markets, the crypto market has evolved a trading ecosystem that stands markedly apart from traditional financial markets—shaped by its unique decentralized nature, low transaction barriers, and highly volatile price dynamics. Vergara and Kristjanpoller (2024) demonstrate that integrating cointegration tests with Deep Reinforcement Learning into a statistical arbitrage framework can substantially mitigate extreme volatility in cryptocurrency markets. DRL effectively captures timely trend shifts via dynamic interaction with market conditions, whereas cointegration analysis anchors the intrinsic long-term relationships among assets, which support real-time portfolio adjustment strategies (sustaining precise capture of pricing discrepancies even when faced with severe price turbulence). As an emerging market which is developing extremely fast, cryptocurrency trading is characterized by its underdeveloped operational mechanisms, fragmented participation, and unpredictable fluctuations. In brief, the crypto market is just at the beginning and still has a long way to go. But it holds considerable potential for future statistical arbitrage given its unique market characteristics and growing data availability.

5.4. Derivative Market

The derivative market stands out as the most challenging to enter due to its complex product design and diverse pricing mechanisms. As the intersection of almost all the financial markets, the derivative market’s distinct branches naturally require entirely different strategies. Below, we provide detailed explanations of the two most representative types of derivative statistical arbitrage. First, for ESG derivatives arbitrage, Kanamura (2025)’s research on sustainability arbitrage pricing reveals that the price boundaries for ESG derivatives are more stringent compared to traditional derivatives. Arbitrage strategies constructed under these constraints may potentially double the Sharpe Ratio. Furthermore, ESG derivative arbitrage strategies exhibit a strong positive correlation with macroeconomic trends compared to other derivatives. This sufficiently demonstrates the significant returns and risk diversification characteristics of ESG derivative arbitrage. Developing statistical arbitrage strategies for ESG derivatives undoubtedly holds promising prospects for the future. And for cryptocurrency derivatives, comparative research by Alexander et al. (2024) reveals that the derivatives market for cryptocurrencies is significantly smaller than that of traditional financial derivatives. This is closely tied to the limited capital pool and liquidity stratification within cryptocurrency markets. These factors impose certain constraints on the upper limit of statistical arbitrage opportunities in this market. However, with the rapid growth of the cryptocurrency industry in recent years, arbitrage potential in cryptocurrency derivatives is gradually expanding.

5.5. ETF&LETF Market

As an extension of the stock market, ETFs and LETFs exhibit smoother performance and lower risk compared to individual stocks. Due to their strong correlation with index movements, arbitrage strategies involving ETFs and LETFs demonstrate the strongest positive correlation with macroeconomic trends. For the ETF arbitrage, research findings indicate that the emergence and disappearance of intraday ETF arbitrage opportunities are closely tied to market liquidity conditions (Marshall et al. 2013). Intraday ETF arbitrage opportunities are almost entirely driven by market liquidity, making it effective to assess the feasibility and profitability of statistical arbitrage strategies based on market liquidity. Pricing discrepancies in LETFs are actually quite common in the market due to the nonlinear relationship between LETF returns and their underlying assets, which provides us with a lot of opportunities. Studies point out that dynamic semiparametric factor models can continuously identify arbitrage opportunities between LETFs, their underlying assets, and options (Nasekin and Härdle 2019). This has been shown to be practically useful in specific LETF arbitrage scenarios.

6. Critical Challenges

As one of the most influential quantitative strategies in the current financial market, statistical arbitrage has attracted wide attention from the academic community and the industry with its continuous brilliant performance in recent years. However, its development environment is undergoing profound changes—the rapid evolution of the financial market structure, the intensification of competition, and the wide application of algorithms and ML trading systems have brought new challenges to the traditional statistical arbitrage model (Avellaneda and Lee 2010).

In order to maintain the profitability and stability of the strategy in such a dynamic environment, there are three key areas where practical challenges must be addressed.

6.1. Transaction Cost Management

In traditional statistical arbitrage strategies, transaction costs have not been regarded as a primary consideration. But in the trend of high-frequency trading, it is more and more important to include transaction costs into consideration. Transaction costs, including slippage, impact costs, liquidity constraints, and securities lending fees, may significantly undermine the feasibility of high-frequency or narrow-spread statistical arbitrage strategies (Frino et al. 2017; Piotrowski and Sladkowski 2004). Therefore, transaction costs should be considered as an important factor in future statistical arbitrage strategies.

6.2. Robustness of Statistical Arbitrage: Theoretical Foundation and Practical Deficiencies

6.2.1. Theoretical Positioning: Robustness as a Necessary Condition

To strengthen the theoretical basis, this study argues that robustness is a necessary theoretical condition for valid statistical arbitrage, not a secondary technical feature. This positioning addresses core tensions between theoretical assumptions and market reality:

Regime-Dependent Mean Reversion: Statistical arbitrage relies on mean reversion (Elliott et al. 2005), but this property is not universal—structural shocks can invalidate cointegration relationships (Hong and Stein 2007; Ramos-Requena et al. 2020), making robustness essential to verify the continuity of arbitrage opportunities.
Non-Stationarity and EMH: Financial data is inherently non-stationary (Dickey and Fuller 1979). Robustness bridges weak-form EMH (Fama 1970) and exploitable mispricing, ensuring profits stem from genuine inefficiencies rather than overfitting (White 2000).
Generalization vs. Optimization: ML models face a theoretical trade-off between in-sample optimization and out-of-sample generalization (Bailey and Lopez De Prado 2014); robustness ensures strategies maintain predictive power across unobserved market states.

6.2.2. Practical Challenges

Poor robustness leads to significant gaps between backtested and actual performance, exacerbated by narrow arbitrage spaces. Gatev et al. (2006) found that window length and pairing methods critically impact profitability, while overfitting often renders spurious signal-based strategies ineffective. Murphy and Gebbie (2021) highlight extreme out-of-sample instability, with parameter bias and overfitting undermining effectiveness (Lütkebohmert and Sester 2021). Traditional parametric methods, dependent on strict distributional assumptions, are sensitive to historical noise. Non-parametric techniques (Murphy and Gebbie 2021) mitigate this by extracting structural patterns, reducing “collapse” risks in market fluctuations despite slight prediction accuracy trade-offs.

6.3. Market Friction and Regulatory Constraints

Finally, market friction and regulatory constraints pose challenges. Market friction—including trading costs, liquidity shortages, margin requirements, and short-selling restrictions—compresses arbitrageurs’ profit margins (Hogan et al. 2004), as small spreads can quickly turn profitable strategies uneconomical. Regulatory risks arise from AI/ML models’ “black box” nature, increasing compliance complexity, and potential systemic risks due to limited transparency.

7. Conceptual Proposal: The Human-AI Synergy Framework

In recent years, statistical arbitrage methods based on ML models have achieved huge success in feature extraction, signal generation, and execution automation. However, based on existing research, such models using fully automated ML systems still have some defects. Therefore, this section makes an attempt to advance a formal conceptual framework. The differences between this original model and previous literature are presented in Table 2. Unlike fragmented existing literature, this proposal is not a new predictive algorithm but a governance-oriented synthesis. It bridges AI’s “black-box” risk with human contextual adaptability, formalizing interaction as a synergistic loop rather than ad hoc intervention. Existing literature on human-AI interaction in statistical arbitrage remains fragmented and limited: most treat human input as ad hoc correction for model failures (e.g., Krauss et al. 2017; Flori and Regoli 2021) or interpretability enhancement (e.g., Zhang et al. 2023), with no structural integration into the strategy lifecycle. Cross-sector studies (Bao and Huang 2021; Chen and Huang 2025) highlight human governance but lack scalability for statistical arbitrage’s real-time and cross-market demands. The proposed framework qualitatively transcends these by institutionalizing collaboration as a governance-centric system, not heuristic intervention.

7.1. Limitations of Statistical Arbitrage with Machine Learning as the Main Body

At present, according to the existing models, statistical arbitrage transactions based on ML mainly have the following five main problems:

First of all, the problems of market non-stability and non-sample failure are prominent. The structure, liquidity, and participant behavior of the financial market will evolve over time, and ML models often assume that the data distribution is stable in the training stage, resulting in the rapid failure of signals after actual deployment (Ning and Lee 2024).

Secondly, the model optimization target generally focuses on expected returns, ignoring the impact of tail risks and black swan events. The deep learning model may perform well in the stable period, but it exposes the effect of nonlinear loss and risk amplification in extreme market conditions (Chow et al. 2018; Rockafellar and Uryasev 2002).

Third, the lack of interpretability of the model makes it difficult to be fully trusted by the compliance and risk control team. For complex neural network trading systems that lack interpretability enhancements, it is often not possible to clearly explain the contribution path of economic logical sources or strategic factors of trading signals (Mosqueira-Rey et al. 2023).

Fourth, the algorithm model is easily affected by data anomalies, delays, and opponent interference, such as being misled by noise news, wrong data matching, or manipulation behavior, which triggers abnormal transactions.

Lastly, there are institutional risks in the fully automated execution system. When market liquidity plummets or fluctuations are sudden, automated trading programs may further amplify the market shock due to the lack of human intervention, forming the so-called “systemic risk at the algorithm level”.

Notably, these limitations are not unique to statistical arbitrage but represent common flaws of technology-driven models across financial sectors. In banking, banks adopting the Expected Credit Loss (ECL) model overly relied on historical data to predict risks during the COVID-19 crisis, leading to significant credit contraction and exacerbating procyclical fluctuations (Chen and Huang 2025). Similarly, in fintech shadow banking, algorithm-based credit expansion during the pandemic failed to anticipate extreme market risks, which may lead to a surge in default rates—borrowers’ preference to repay traditional bank loans first further highlighted the vulnerability of purely algorithm-driven financial tools (Bao and Huang 2021). These cross-sector empirical findings confirm that over-reliance on technical models, regardless of their specific application scenarios, tends to underestimate unexpected risks in crisis situations, thereby reinforcing the urgency of addressing these limitations in ML-driven statistical arbitrage. To sum up, we can find that in all fields of finance, it is difficult to cope with extreme market risks by purely relying on technical models. This also tells us that for the statistical arbitrage model, it is not perfect enough for us to drive it with fully automated ML systems. It can be further improved, and many such problems can be solved by the complement of human wisdom.

7.2. Compensatory Advantages and Irreplaceability of Human Wisdom

Against the background of the above defects, the participation of human experts shows the irreplaceable ability of AI in many key links, such as (1) macro judgment and institutional experience: humans can quickly identify the potential impact of policies, geopolitical events, or regulatory changes on the market structure, and make overall judgments across markets and assets. (2) Contextual understanding and commonsense reasoning: The understanding of news text, company announcements, and industry background enables humans to distinguish between “false noise” and “substantial information” to avoid the model triggering wrong signals due to semantic misinformation. (3) Risk control and rule setting: humans can set rigid constraints on strategies, including the upper limit of a single position, the daily VaR/CVaR limit, and the maximum pullback threshold, etc., to prevent the model from being overexposed to risks in pursuit of profits. (4) Black swan event intervention: when there are extreme fluctuations or algorithm abnormalities in the market, humans can immediately trigger the “emergency stop mechanism” and implement phased closing, adjusting execution parameters, and other operations. (5) Explanatory review and economic rationality test: human experts can judge whether the signal has economic logic based on the characteristic contribution or factor importance analysis (such as SHAP value) of the model output, to prevent the model from overfitting.

It can be seen that human beings have provided compensatory wisdom that is currently difficult for ML to replicate in risk judgment, abnormal situational intervention, and value rational supervision (Yuan et al. 2024; Mosqueira-Rey et al. 2023). Wang et al. (2025) demonstrate that, on average, the creativity performance of humans and LLMs is similar, but there is a significant difference in the distribution extremes: human creativity is more variable and occupies an obvious advantage in high-original output. This reflects the relatively irreplaceable role of humans in dealing with risks at present.

7.3. The Design of the Statistical Arbitrage System of Human-AI Synergy

Based on the “AI-led, human-assisted” concept, this study proposes a human-AI collaborative statistical arbitrage framework derived from recurring literature themes. The system is constructed based on the summary of a former study and tries to achieve a balance of efficiency, stability, and regulatory compliance through the organic combination of algorithm automation and a human supervision mechanism. The complete structure of this system is shown in Figure 2.

First, the AI-led module: In existing human-in-the-loop trading systems discussed in the literature, AI components are typically positioned as the primary engines for signal generation and execution optimization, while human oversight is emphasized as a control layer rather than a substitute for algorithmic decision-making. Through well-designed multi-factor feature engineering, nonlinear modeling, and risk-sensitive RL, the model can effectively capture the mean regression relationship in multi-market and multi-frequency data under appropriate conditions. In the training process, CVaR constraints are embedded to limit tail losses while maximizing returns (Chow et al. 2018; Rockafellar and Uryasev 2002). This design breaks from existing concepts where AI and human roles are siloed: unlike studies that relegate humans to post hoc adjustments (Lütkebohmert and Sester 2021), the framework positions AI and human oversight as mutually reinforcing, operating synchronously across the strategy lifecycle.

Second, the human supervision and risk correction module: Human experts mainly supervise and intervene at the key nodes of model operation. For example: (1) Risk threshold setting: the risk control team regularly adjusts rigid indicators such as position, volatility, VaR, etc. (2) Manual signal verification: manual review and approval of high-trust but high-risk transaction signals. (3) Emergency stop mechanism: When the system detects market fluctuations or liquidity crises, it automatically triggers the manual intervention program. (4) Regular explanatory review: signal logic, factor contribution, and economic rationality of the human analysis model to prevent algorithm black boxing. The design of these human supervision nodes draws on proven risk governance experiences from other financial sectors. Chen and Huang (2025) found that regulatory intervention and manual risk calibration effectively mitigated the procyclical risks of ECL models in banks, which directly informs the “risk threshold setting” and “VaR/CVaR rigid constraints” in our framework—by incorporating human judgment on macroeconomic trends and crisis characteristics, we avoid the over-reliance on historical data that plagues purely technical models. Meanwhile, Bao and Huang (2021) revealed that the lack of manual review of high-risk signals led to concentrated default risks in fintech credit, which justifies the necessity of our ‘manual signal verification’ module. Specifically, for arbitrage signals generated under extreme market conditions (e.g., pandemic-induced volatility), human experts can leverage contextual understanding and institutional experience to distinguish between transient noise and substantive pricing deviations, preventing abnormal transactions triggered by algorithmic misjudgment—when supported by sufficient domain expertise and timely information, this human intervention can complement the algorithm’s weakness in processing unstructured, nonquantitative information (e.g., policy shocks, geopolitical events) highlighted in both studies. Compared to existing frameworks that limit human involvement to interpretability checks (Mosqueira-Rey et al. 2023), this module expands human expertise to proactive risk calibration and abnormal scenario preemption—addressing tail-risk and regime instability gaps unaddressed by purely statistical adjustments.

Lastly, there should also be a feed closed loop and continuous learning: After each review of manual intervention and abnormal events, the system reincorporates the relevant records into the training sample for model relearning and parameter recalibration, forming a continuous improvement closed loop of ‘AI-led—human correction—AI relearning’, which is intended to continuously improve the robustness of the strategy when human corrections are based on sound judgment. This closed-loop mechanism differentiates from static collaboration models (e.g., Han et al. 2023), as human corrections are systematically reincorporated into model training, enabling adaptive resilience that outperforms one-way human-AI interaction.

Theoretically, this framework resolves the aforementioned robustness tensions by institutionalizing complementary governance. Human oversight addresses the regime instability and interpretability gaps by enforcing economic rationale and macro context, while AI maintains predictive efficiency. This structure transforms robustness from a passive statistical property into an active theoretical construct—ensuring the strategy adheres to the core principles of mean reversion while adapting to structural market changes.

8. Conclusions

Statistical arbitrage has evolved from predominantly linear, econometric formulations toward increasingly data-driven and algorithmic paradigms. While ML techniques have substantially expanded modeling capabilities, the literature reviewed in this study highlights persistent challenges related to model instability, limited interpretability, and the gap between algorithmic optimization and real-world decision-making constraints.

8.1. Summary

Rather than proposing another modeling approach, this paper contributes by reorganizing the statistical arbitrage literature from a problem-oriented perspective. Through a structured review, it identifies recurring challenges that cut across asset classes and methodological paradigms and shows that many of these issues are not purely technical, but organizational and governance-related in nature.

Building on the structured synthesis of existing literature, the paper proposes a conceptual AI–led, human-in-the-loop statistical arbitrage framework. The framework clarifies how algorithmic efficiency and human judgment can be systematically combined, positioning ML as the primary engine for signal generation while reserving defined roles for human oversight in risk calibration, interpretation, and exceptional market conditions. This contribution is intended as an analytical blueprint rather than an empirically validated trading system.

Furthermore, this framework resonates with the core findings of prominent studies in banking and fintech (Bao and Huang 2021; Chen and Huang 2025). Both studies emphasize that technical models (ECL, fintech algorithms) require complementary human governance to balance efficiency and stability, and our work advances this line of research by systematically constructing a scalable, operational human-AI collaboration mechanism applicable to quantitative trading. Instead of focusing on single-sector model optimization, this paper integrates cross-sector empirical evidence to propose a universal governance logic for technology-driven financial tools. This extends the contribution of the study.

Overall, the reviewed literature suggests that future developments in statistical arbitrage are not likely to involve the replacement of classical approaches or human judgment, but rather their structured integration. By combining the interpretability of econometric foundations with the nonlinear modeling capabilities of ML models, the proposed framework offers a coherent perspective for future research on statistical arbitrage system design. The conceptual contribution of this framework lies in reframing insights already present in the literature by interpreting human intervention not as a discretionary override, but as an institutionalized governance layer embedded within ML-driven statistical arbitrage systems.

8.2. Limitations

This study is primarily conceptual in nature and abstracts from empirical implementation in order to focus on conceptual integration and governance design. The proposed human–AI collaborative framework is intended as an analytical and organizational synthesis of existing literature rather than a validated operational system.

As a result, detailed experimental design, parameter calibration, and performance evaluation across asset classes and market conditions are left for future research. Empirical studies that implement and test this framework under realistic trading environments would be a valuable extension of the present work.

Funding

This research received no external funding.

Data Availability Statement

The data are not available due to privacy concerns. The data presented in this study are available on request from the corresponding author.

Acknowledgments

I hereby express my sincere gratitude to all authors of the literature cited in this thesis, whose rigorous research findings and profound academic insights have laid a solid theoretical foundation for the completion of this study.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Based on a synthesis of previous literature, in this section, we will present the mathematical formulas and methodologies related to machine learning and econometrics that were mentioned earlier in the paper.

Median Absolute Deviation (MAD):

M A D = m e d i a n (|x_{i} - m e d i a n (x)|)

$x_{i}$ : The i-th observation in the dataset (e.g., asset prices, returns);
$m e d i a n (x)$ : The median of the dataset, i.e., the middle value when all observations are sorted.

The smaller the MAD value, the lower the degree of data dispersion.

The Z-score standardization:

z_{i} = \frac{x_{i} - μ}{σ}

$z_{i}$ : The standardized value of the i-th observation;
$x_{i}$ : The original value of the i-th observation (e.g., factor values, asset returns);
$μ$ : The population mean of the dataset;
$σ$ : The population standard deviation of the dataset.

Engle-Granger two-step method:

Step 1: Dual-asset cointegration regression, with the basic formula as follows:

y_{t} = α + β x_{t} + ε_{t}

Step 2: Residual Stationarity Test (ADF Test Core Formula (0)), whose basic formula is:

Δ μ_{t} = γ_{y_{t - 1}} + \sum_{i = 1}^{p} δ_{i} Δ y_{t - i} + ε_{t}

To examine the cointegration relationship among two or more assets, the Johansen maximum likelihood method (Johansen 1988) is typically employed. Its fundamental VAR model is:

Δ x_{t} = Γ_{1} Δ x_{t - 1} + \dots + Γ_{k - 1} Δ x_{t - k + 1} + Π x_{t - 1} + ε_{t}

$y_{t}$ : The dependent variable time series (e.g., price of Asset A);
$x_{t}$ : The independent variable time series (e.g., price of Asset B), assumed to have a long-term linear relationship with $y_{t}$ ;
$α$ : The regression intercept term;
$β$ : The regression slope coefficient, measuring the marginal effect of $x_{t}$ on $y_{t}$ ;
$γ$ : The adjustment coefficient, indicating the speed at which the residual reverts to zero (stationarity requires $y_{t}$ < 0);
$p$ : The number of lagged difference terms for residuals;
$ε_{t}$ : The random error term of the residual regression.

Information Content (IC):

I C = \frac{Cov (f_{t}, r_{t + 1})}{σ (f_{t}) σ (r_{t + 1})}

$Cov (f_{t}, r_{t + 1})$ : Covariance between factor $f_{t}$ and next period’s return $r_{t + 1}$
$f_{t}$ : Factor value (e.g., trading volume) at time t
$r_{t + 1}$ : Asset return in next period

Among these, |IC| > 0.1 and p < 0.05 are commonly interpreted as indicating economically meaningful predictive power.

Bellman optimality equation:

Q (s, a) = E [r_{t} + γ \underset{a^{,}}{m a x} Q (s_{t + 1}, a^{'})]

$Q (s, a)$ (Action-value function): The expected long-term cumulative reward an agent can obtain by taking action a in state s and following the optimal policy thereafter.
$s$ (State): The state of the system (or market) at time t—in trading, this could include market volatility, asset spreads, liquidity, etc.
$a$ (Action): The action chosen by the agent in state s—in trading, this might involve position sizing, entering/exiting a trade, or adjusting holdings.
$r_{t}$ (Immediate reward): The immediate reward (or risk-adjusted return) received by the agent after taking action $a_{t}$ in state $s_{t}$ .
$γ$ (Discount factor): A value between 0 and 1 that discounts future rewards, placing less weight on more distant future outcomes.
$a^{'}$ (Next action): Any possible action the agent can take in the next state $s_{t + 1}$ .
where $γ \in (0,1)$ denotes the discount factor. In practical applications, deep neural networks are often employed to approximate the Q-function, and this approach has contributed to the widespread adoption of Deep Q-Network (DQN) architectures in many statistical arbitrage and portfolio optimization studies.

Minimized mean square error (MSE):

L = \frac{1}{N} \sum_{t = 1}^{N} {({\hat{y}}_{t} - y_{t})}^{2}

$L$ : Loss (prediction error)
$N$ : Total number of data points
${\hat{y}}_{t}$ : Predicted value
$y_{t}$ : Actual observed value

Among them,

y_{t}

is the real observed value. By continuously optimizing

θ

, the model learns potential arbitrage signals in historical data and uses them for strategy execution on off-sample data.

GARCH and SV model:

σ_{t}^{2} = ω + α ε_{t - 1}^{2} + β σ_{t - 1}^{2}

$σ_{t}^{2}$ : Variance (volatility squared) at time t
$ω$ : Constant term
$α$ : Weight of previous period’s error squared ( $ε_{t - 1}^{2}$ )
$β$ : Weight of the previous period’s variance ( $σ_{t - 1}^{2}$ )

The SV model is often used as a supplement to the GARCH model. When combined into a GARCH-SV model, it retains the GARCH model’s ability to capture volatility clustering while incorporating the SV model’s description of random changes in volatility, enabling the calculation of a more accurate asset volatility

σ_{t}

.

With volatility

σ_{t}

available, risk can be measured using VaR and CVaR methods.

VaR Method: V a R_{0.95} = - F^{- 1} (0.05)

$- F^{- 1}$ : Inverse of the asset return distribution function

In this method, F is the distribution function of asset returns, and this equation represents a 95% probability that losses will not exceed this value (this equation measures the maximum potential loss).

CVaR Method: C V a R_{α} = - \frac{1}{α} \int_{0}^{α} F^{- 1} (u) d u

$u$ : Integral variable (covers extreme loss range)

This method measures the average level of extreme losses (at a confidence level σ) that occur when losses exceed VaR.

In short, the GARCH-SV model calculates more precise volatility, while methods using VaR and CVaR quantify the maximum possible loss of an investment and the average level of loss under extreme conditions.

Dynamic Control: After obtaining the aforementioned risk quantification results, we directly link the quantified risk to actual positions using a specific formula, providing a clear basis for dynamic position control. This formula is:

P o s i t i o n = \frac{T a r g e t R i s k}{σ_{t}}

$T a r g e t R i s k$ : Maximum risk an investor accepts
$σ_{t}$ : Current asset volatility

The target risk here essentially refers to the predetermined maximum risk an investor can tolerate. Dividing this by the current asset volatility

σ_{t}

yields the position size, which represents the actual capital allocation ratio. Using this approach, we can adjust holdings through quantitative strategies based on our risk tolerance (Bollerslev 1986).

References

Alexander, Carol, Xia Chen, Jia Deng, and Tao Wang. 2024. Arbitrage opportunities and efficiency tests in crypto derivatives. Journal of Financial Markets 71: 100930. [Google Scholar] [CrossRef]
Avellaneda, Marco, and Jeong-Hyun Lee. 2010. Statistical arbitrage in the US equities market. Quantitative Finance 10: 761–82. [Google Scholar] [CrossRef]
Bailey, David H., and Marcos Lopez De Prado. 2014. The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality. Journal of Portfolio Management 40: 94–107. [Google Scholar] [CrossRef]
Balladares, Karla, Juan P. Ramos-Requena, Juan E. Trinidad-Segovia, and Miguel A. Sánchez-Granero. 2021. Statistical Arbitrage in Emerging Markets: A Global Test of Efficiency. Mathematics 9: 179. [Google Scholar] [CrossRef]
Bao, Zhengyang, and Difang Huang. 2021. Shadow banking in a crisis: Evidence from FinTech during COVID-19. Journal of Financial and Quantitative Analysis 56: 2320–55. [Google Scholar] [CrossRef]
Bollerslev, Tim. 1986. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31: 307–27. [Google Scholar] [CrossRef]
Caneo, Fernando, and Werner Kristjanpoller. 2021. Improving statistical arbitrage investment strategy: Evidence from Latin American stock markets. International Journal of Finance & Economics 26: 4424–40. [Google Scholar]
Chen, Chen, and Difang Huang. 2025. A Tale of Two Banks: When Credit Loss Models Meet Economic Crises. Journal of Accounting Research, ahead of print. [Google Scholar]
Chow, Yinlam, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. 2018. Risk-Constrained Reinforcement Learning with Percentile Risk Criteria. The Journal of Machine Learning Research 18: 6070–6120. [Google Scholar]
Ciacci, Alberto, Takumi Sueshige, Hideki Takayasu, Kim Christensen, and Misako Takayasu. 2020. The microscopic relationships between triangular arbitrage and cross-currency correlations in a simple agent based model of foreign exchange markets. PLoS ONE 15: e0234709. [Google Scholar] [CrossRef]
Coache, Anthony, Sebastian Jaimungal, and Álvaro Cartea. 2023. Conditionally Elicitable Dynamic Risk Measures for Deep Reinforcement Learning. SIAM Journal on Financial Mathematics 14: 1249–89. [Google Scholar] [CrossRef]
De Moura, Carlos E., Adrian Pizzinga, and José Zubelli. 2016. A pairs trading strategy based on linear state space models and the Kalman filter. Quantitative Finance 16: 1559–73. [Google Scholar] [CrossRef]
Dickey, David A., and Wayne A. Fuller. 1979. Distribution of the Estimators for Autoregressive Time Series with a Unit Root. Journal of the American Statistical Association 74: 427–31. [Google Scholar]
Do, Binh, and Robert Faff. 2010. Does Simple Pairs Trading Still Work? Financial Analysts Journal 66: 83–95. [Google Scholar] [CrossRef]
Elliott, Robert J., Jan Van Der Hoek, and William P. Malcolm. 2005. Pairs trading. Quantitative Finance 5: 271–76. [Google Scholar] [CrossRef]
Engle, Robert. 2002. Dynamic Conditional Correlation: A Simple Class of Multivariate Generalized Autoregressive Conditional Heteroskedasticity Models. Journal of Business and Economic Statistics 20: 339–50. [Google Scholar] [CrossRef]
Engle, Robert F., and Clive W. J. Granger. 1987. Co-Integration and Error Correction: Representation, Estimation, and Testing. Econometrica 55: 251. [Google Scholar] [CrossRef]
Fama, Eugene F. 1970. Efficient capital markets: A review of theory and empirical work. The Journal of Finance 25: 383–417. [Google Scholar] [CrossRef]
Fischer, Thomas, and Christopher Krauss. 2018. Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research 270: 654–69. [Google Scholar] [CrossRef]
Flori, Andrea, and Daniele Regoli. 2021. Revealing Pairs-trading opportunities with long short-term memory networks. European Journal of Operational Research 295: 772–91. [Google Scholar] [CrossRef]
Frino, Alex, Vito Mollica, Robert I. Webb, and Shunquan Zhang. 2017. The impact of latency sensitive trading on high frequency arbitrage opportunities. Pacific-Basin Finance Journal 45: 91–102. [Google Scholar] [CrossRef]
Gatev, Evan, William N. Goetzmann, and K. Geert Rouwenhorst. 2006. Pairs Trading: Performance of a Relative-Value Arbitrage Rule. Review of Financial Studies 19: 797–827. [Google Scholar] [CrossRef]
Grinold, Richard C., and Ronald N. Kahn. 2000. Active Portfolio Management. Available online: https://cms.dm.uba.ar/Members/maurette/ACF2022/Richard%20Grinold%2C%20Ronald%20Kahn-Active%20Portfolio%20Management_%20A%20Quantitative%20Approach%20for%20Producing%20Superior%20Returns%20and%20Controlling%20Risk-McGraw-Hill%20%281999%29.pdf (accessed on 25 January 2026).
Han, Chulwoo, Zhaodong He, and Alenson J. W. Toh. 2023. Pairs trading via unsupervised learning. European Journal of Operational Research 307: 929–47. [Google Scholar] [CrossRef]
Hogan, Steve, Robert Jarrow, Melvyn Teo, and Mitch Warachka. 2004. Testing market efficiency using statistical arbitrage with applications to momentum and value strategies. Journal of Financial Economics 73: 525–65. [Google Scholar] [CrossRef]
Hong, Harrison, and Jeremy C. Stein. 2007. Disagreement and the Stock Market. Journal of Economic Perspectives 21: 109–28. [Google Scholar] [CrossRef]
Huck, Nicolas. 2019. Large data sets and machine learning: Applications to statistical arbitrage. European Journal of Operational Research 278: 330–42. [Google Scholar] [CrossRef]
Jaimungal, Sebastian, Silvana M. Pesenti, Ye Sheng Wang, and Hariom Tatsat. 2022. Robust Risk-Aware Reinforcement Learning. SIAM Journal on Financial Mathematics 13: 213–26. [Google Scholar] [CrossRef]
Jarrow, Robert, Melvyn Teo, Yiu K. Tse, and Mitch Warachka. 2012. An improved test for statistical arbitrage. Journal of Financial Markets 15: 47–80. [Google Scholar] [CrossRef]
Jegadeesh, Narasimhan, and Sheridan Titman. 1993. Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency. The Journal of Finance 48: 65–91. [Google Scholar] [CrossRef]
Johansen, Søren. 1988. Statistical analysis of cointegration vectors. Journal of Economic Dynamics and Control 12: 231–54. [Google Scholar] [CrossRef]
Kanamura, Takashi. 2025. Sustainability arbitrage pricing of ESG derivatives. International Review of Financial Analysis 104: 104177. [Google Scholar] [CrossRef]
Krauss, Christopher, Xuan A. Do, and Nicolas Huck. 2017. Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500. European Journal of Operational Research 259: 689–702. [Google Scholar]
Lütkebohmert, Eva, and Julian Sester. 2021. Robust statistical arbitrage strategies. Quantitative Finance 21: 379–402. [Google Scholar] [CrossRef]
Marshall, Ben R., Nhut H. Nguyen, and Nuttawat Visaltanachoti. 2013. ETF arbitrage: Intraday evidence. Journal of Banking & Finance 37: 3486–98. [Google Scholar] [CrossRef]
Menkveld, Albert J. 2013. High frequency trading and the new market makers. Journal of Financial Markets 16: 712–40. [Google Scholar] [CrossRef]
Mosqueira-Rey, Eduardo, Elena Hernández-Pereira, David Alonso-Ríos, José Bobes-Bascarán, and Ángel Fernández-Leal. 2023. Human-in-the-loop Machine Learning: A State of the Art. Artificial Intelligence Review 56: 3005–54. [Google Scholar] [CrossRef]
Murphy, Nicholas, and Tom Gebbie. 2021. Learning the dynamics of technical trading strategies. Quantitative Finance 21: 1325–49. [Google Scholar] [CrossRef]
Nasekin, Sergey, and Wolfgang K. Härdle. 2019. Model-driven statistical arbitrage on LETF option markets. Quantitative Finance 19: 1817–37. [Google Scholar] [CrossRef]
Ning, Bo, and Kiseop Lee. 2024. Advanced Statistical Arbitrage with Reinforcement Learning. arXiv arXiv:2403.12180. [Google Scholar]
Piotrowski, Edward W., and Jan Sladkowski. 2004. Arbitrage risk induced by transaction costs. Physica A: Statistical Mechanics and Its Applications 331: 233–39. [Google Scholar] [CrossRef]
Ramos-Requena, José Pedro, Juan E. Trinidad-Segovia, and Miguel Á. Sánchez-Granero. 2020. Some Notes on the Formation of a Pair in Pairs Trading. Mathematics 8: 348. [Google Scholar] [CrossRef]
Rockafellar, R. Tyrrell, and Stanislav Uryasev. 2002. Conditional Value-at-Risk for General Loss Distributions. Journal of Banking & Finance 26: 1443–71. [Google Scholar]
Rousseeuw, Peter J., and Christophe Croux. 1993. Alternatives to the Median Absolute Deviation. Journal of the American Statistical Association 88: 1273–83. [Google Scholar] [CrossRef]
Serrano, Camilo, and Martin Hoesli. 2012. Fractional Cointegration Analysis of Securitized Real Estate. The Journal of Real Estate Finance and Economics 44: 319–38. [Google Scholar] [CrossRef]
Stephenson, Jeff, Bruce Vanstone, and Tobias Hahn. 2021. A Unifying Model for Statistical Arbitrage: Model Assumptions and Empirical Failure. Computational Economics 58: 943–64. [Google Scholar] [CrossRef]
Troyanskaya, Olga, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17: 520–25. [Google Scholar] [CrossRef]
Vergara, Gabriel, and Werner Kristjanpoller. 2024. Deep reinforcement learning applied to statistical arbitrage investment strategy on cryptomarket. Applied Soft Computing 153: 111255. [Google Scholar] [CrossRef]
Wang, Dawei, Difang Huang, Haipeng Shen, and Brian Uzzi. 2025. A large-scale comparison of divergent creativity in humans and large language models. Nature Human Behaviour, 1–10. [Google Scholar] [CrossRef]
White, Halbert. 2000. A reality check for data snooping. Econometrica 68: 1097–126. [Google Scholar] [CrossRef]
Yuan, Hao, Saizhuo Wang, and Jian Guo. 2024. Alpha-GPT 2.0: Human-in-the-Loop AI for Quantitative Investment. arXiv arXiv:2402.09746. [Google Scholar]
Zhang, Weiqian, Songsong Li, Zhichang Zhen Guo, and Yizhe Yang. 2023. A hybrid forecasting model based on deep learning feature extraction and statistical arbitrage methods for stock trading strategies. Journal of Forecasting 42: 1729–49. [Google Scholar] [CrossRef]

Figure 1. The evolution and integration of statistical arbitrage methods.

Figure 2. The Structure of the Human-AI framework.

Table 1. Methodological Quality Assessment Criteria.

Dimension	Score = 0	Score = 1	Score = 2
Data (D)	Short sample	Moderate sample	Long multi-regime
Model (M)	In-sample only	Partial OOS	Full OOS + CV
Robustness (R)	None	Limited	Extensive
Performance (P)	Return only	Basic risk	Comprehensive
Interpretability (I)	Black-box	Partial	Clear rationale

Table 2. Structural Comparison: Literature Review Findings vs. Proposed Synergy Framework.

Comparison Dimension	Literature Review Analysis (Section 4 and Section 5)	Proposed Synergy Framework (Section 7)
Research Scope	Investigating existing ML models and their historical performance.	Developing a conceptual governance layer for ML systems.
Core Methodology	Synthesis of econometric, ML, and RL-based strategies.	Integration of ML signal modeling with structured human oversight.
Operational Focus	Signal accuracy and profit maximization	Risk calibration and system robustness.
Handling of Anomalies	Reliance on model retraining and statistical adjustments.	Utilization of discretionary intervention and expert judgment.
Role of Human Input	Identified as a fragmented or secondary component.	Formalized as a central “Human-in-the-loop” module.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lei, B. Human-AI Synergy in Statistical Arbitrage: Enhancing Robustness Across Volatile Financial Markets. Risks 2026, 14, 63. https://doi.org/10.3390/risks14030063

AMA Style

Lei B. Human-AI Synergy in Statistical Arbitrage: Enhancing Robustness Across Volatile Financial Markets. Risks. 2026; 14(3):63. https://doi.org/10.3390/risks14030063

Chicago/Turabian Style

Lei, Binxu. 2026. "Human-AI Synergy in Statistical Arbitrage: Enhancing Robustness Across Volatile Financial Markets" Risks 14, no. 3: 63. https://doi.org/10.3390/risks14030063

APA Style

Lei, B. (2026). Human-AI Synergy in Statistical Arbitrage: Enhancing Robustness Across Volatile Financial Markets. Risks, 14(3), 63. https://doi.org/10.3390/risks14030063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Human-AI Synergy in Statistical Arbitrage: Enhancing Robustness Across Volatile Financial Markets

Abstract

1. Introduction

2. Theoretical Foundations and Core Issues

2.1. The Economic Significance of Statistical Arbitrage

2.2. Key Issues and Controversies

3. Methodology

3.1. Search Strategy and Eligibility Criteria and Exclusion Principles

3.1.1. Search Strategy

3.1.2. Eligibility Criteria

3.1.3. Exclusion Principles

3.2. Study Quality, Reliability Assessment, and Methodological Scoring System

3.3. Methodological Rigor and Robustness Assessment

3.4. Synthesis and Framework Derivation

3.5. The PRISMA Flow

4. Literature Synthesis: From Traditional Econometrics to AI Integration

4.1. Statistical Arbitrage in the Era of Classical Econometric Model

4.1.1. Data Preprocessing

4.1.2. Statistical Relationship Identification

4.1.3. Factor and Dimension Optimization

4.1.4. Dynamic Forecasting and Signal Capture

4.1.5. Volatility and Risk Control

4.1.6. Deepening Multivariate Dependencies

4.1.7. Summary of This Era

4.2. The Extension in the Era of AI

4.2.1. Supervised Learning and Unsupervised Learning

4.2.2. Reinforcement Learning

4.2.3. Deep Learning

4.2.4. Summary of the Extension in This Era

5. Literature Synthesis: Cross-Market Performance

5.1. Stock Market

5.2. FX Market

5.3. Crypto Market

5.4. Derivative Market

5.5. ETF&LETF Market

6. Critical Challenges

6.1. Transaction Cost Management

6.2. Robustness of Statistical Arbitrage: Theoretical Foundation and Practical Deficiencies

6.2.1. Theoretical Positioning: Robustness as a Necessary Condition

6.2.2. Practical Challenges

6.3. Market Friction and Regulatory Constraints

7. Conceptual Proposal: The Human-AI Synergy Framework

7.1. Limitations of Statistical Arbitrage with Machine Learning as the Main Body

7.2. Compensatory Advantages and Irreplaceability of Human Wisdom

7.3. The Design of the Statistical Arbitrage System of Human-AI Synergy

8. Conclusions

8.1. Summary

8.2. Limitations

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI