Symbolic Early Stopping in Neural Sequence Models via Mapper-Induced Symbolic Dynamics

Tomilov, Ivan; Zamotaev, Rodion; Gusarova, Natalia; Vatian, Aleksandra

doi:10.3390/technologies14060339

Open AccessArticle

Symbolic Early Stopping in Neural Sequence Models via Mapper-Induced Symbolic Dynamics

School of Translational Information Technologies, ITMO University, 197101 St. Petersburg, Russia

^*

Authors to whom correspondence should be addressed.

Technologies 2026, 14(6), 339; https://doi.org/10.3390/technologies14060339

Submission received: 26 March 2026 / Revised: 19 May 2026 / Accepted: 27 May 2026 / Published: 3 June 2026

(This article belongs to the Section Information and Communication Technologies)

Download

Browse Figures

Versions Notes

Abstract

Early stopping is a standard form of implicit regularization in neural sequence models, but criteria based solely on validation loss can become unstable or weakly informative in noisy, non-stationary, or weakly separated regimes. We propose Symbolic Early Stopping (SES), a representation-aware hybrid stopping criterion that monitors the evolution of validation hidden-state organization during training. At each epoch, SES constructs a Mapper-based symbolic abstraction of hidden representations extracted from a fixed monitored layer, transforms latent trajectories into symbol sequences, and summarizes them through a compact set of symbolic–dynamic descriptors capturing sequential complexity, transition uncertainty, and geometric dispersion. These descriptors are aggregated into a single symbolic stability score, which is combined with validation-loss monitoring to detect convergence of the learned representation. We evaluate SES on recurrent, bidirectional recurrent, and encoder-only Transformer architectures across multiple time-series regimes with different levels of structural regularity and noise. The results indicate that SES frequently terminates training substantially earlier than conservative loss-based baselines while preserving a competitive quality–efficiency trade-off relative to oracle validation-based stopping. Robustness experiments under additive input noise show that the symbolic monitoring signal remains informative under moderate perturbations, although its advantage is not uniform across all datasets and model classes. A layer-wise analysis further suggests that useful stopping signals may emerge before the final validation curve fully stabilizes, reflecting earlier organization of latent representations. Overall, SES provides an interpretable and computationally tractable framework for representation-level early stopping in neural sequence modeling.

Keywords:

early stopping; neural sequence models; representation monitoring; symbolic dynamics; topological data analysis; Mapper; time-series forecasting

1. Introduction

Early stopping (ES) is one of the most widely used forms of implicit regularization in deep learning. By halting optimization before the model enters a regime of excessive memorization, ES can improve generalization while also reducing the training cost [1,2,3]. The ES is usually implemented through validation-loss monitoring, sometimes augmented with patience, smoothing, or trend heuristics.

However, output-space signals are not always informative enough for stopping decisions. In many training regimes, controlling only validation metrics may appear to be insufficient because such criteria often do not distinguish between performance fluctuations driven by inherent data noise and those related to incomplete or unstable model adaptation. In noisy, small, non-stationary, or weakly separated datasets, validation loss may plateau late, fluctuate strongly, or react only after the internal representation has already stabilized.

This limitation is especially relevant for neural sequence models (NSMs), i.e., the architectures designed to capture dependencies across positions or time steps [4]. Although the broader NSM family now includes both foundation models and state-space models, in this work, we focus on conventional sequence architectures and empirically study recurrent, bidirectional recurrent, and encoder-only Transformer models. Their hidden states evolve along structured trajectories in latent space, and their training dynamics are often sensitive to initialization, noise, and architectural scale.

RNN-type architectures are known to suffer from optimization instabilities such as vanishing or exploding gradients, whereas highly parameterized Transformers may exhibit substantial memorization capacity and overfitting tendencies [5,6,7]. These observations motivate representation-aware stopping: if the latent organization of hidden trajectories stabilizes before the validation curve fully settles, then the evolution of the internal state can serve as an auxiliary stopping signal.

Topological data analysis (TDA) offers one family of tools for describing the geometry of neural representations, while symbolic dynamics (SD) provides a complementary description of trajectory organization through finite symbol strings and complexity measures. The existing TDA-based analyses of neural networks (NN) rely largely on persistent-homology-type summaries, which can become costly and noise-sensitive in high-dimensional settings. Symbolic-dynamics studies highlight the prospects of Lempel–Ziv-type descriptors and entropy measures, but leave unaddressed the question of how to obtain a stable and interpretable discretization of latent trajectories for early stopping.

We therefore propose a representation-aware ES method for neural sequence models that combines symbolic dynamics and topological data analysis. We refer to this method as Symbolic Early Stopping (SES). The central idea is to track not only the validation loss, but also the large-scale organization of hidden-state trajectories during training. Figure 1 below illustrates the structural primitive used by SES—a Mapper-style graph constructed over hidden-state samples—whereas the full SES processing pipeline (epoch-wise hidden-state extraction, Mapper construction, symbolization, descriptor computation, aggregation, and the hybrid stopping decision) is presented later in Figure 2.

After each training epoch, we extracted hidden representations from a fixed monitored layer of the neural sequence model. In the implementation used throughout the main SES pipeline, this monitored representation was the final hidden layer. For recurrent models, we used the last hidden state of the final recurrent layer; for bidirectional recurrent models, we used the concatenation of the final forward and backward hidden states; and for Transformers, we used the output of the final encoder stack, followed by mean pooling over the sequence dimension to obtain a fixed-dimensional embedding. In our experiments, the recurrent baselines were single-layer LSTMs, so this monitored representation coincided with the only recurrent layer.

The proposed method monitors the internal state of the model directly rather than inferring convergence only from validation curves. It is, therefore, especially relevant for sequence models and noisy forecasting settings, where loss-only signals may be unstable or delayed relative to latent-structure stabilization.

We do not claim universal dominance of SES over all stopping rules. Rather, we position SES as an interpretable representation-aware criterion that often offers an attractive trade-off between stopping quality, compute savings, and robustness of latent-state monitoring.

The aim of this work is to develop and evaluate a representation-aware early-stopping rule for neural sequence models. Instead of relying only on validation loss, SES monitors whether hidden-state trajectories become structurally stable during training. The method is designed to remain computationally practical at the epoch level and to keep the stopping signal interpretable through symbolic-dynamics descriptors.

SES is mainly intended for settings where validation curves are noisy, delayed, or weakly separated, such as small or noisy time-series datasets, regime-switching signals, and sequence models trained on data with both periodic and stochastic components. In our experiments, the most stable gains are observed on structured ETT-type and quasi-periodic data. On strongly chaotic or weakly structured datasets, such as Lorenz, Bitcoin, and EEG, SES remains useful but should be interpreted as a quality–efficiency trade-off rather than a universally superior stopping rule.

The novelty of SES is that it combines three ideas in one stopping criterion. Unlike Patience and Slope, it does not use only the output-space validation curve. Unlike SVCCA-style methods, it does not only compare hidden representations pointwise across epochs. Unlike persistent-homology-based approaches, it avoids expensive full topological summaries. Instead, SES uses Mapper to construct a coarse partition of the hidden-state space, converts validation trajectories into symbolic sequences, and computes compact symbolic-dynamics descriptors. These descriptors are then combined with rank aggregation, liveness filtering, and a validation-loss guard to obtain a practical representation-aware stopping rule.

The main contributions of this paper are as follows:

We introduce SES, a hybrid representation-aware stopping criterion for neural sequence models based on Mapper-induced symbolization of hidden-state trajectories obtained during validation.
We use Mapper as a lightweight topological scaffold for hidden-state organization, reducing the computational burden of full simplicial-complex constructions while preserving the large-scale structure relevant for stopping decisions.
We define validation symbolization with respect to the phase-space partition inferred directly from validation hidden states at each epoch, which keeps the method consistent with the implemented algorithmic pipeline.
We aggregate several symbolic and entropy-geometric descriptors into a single score to improve robustness with respect to short-term fluctuations of individual metrics.
We study the variability of stopping times across individual SD metrics and their ensembles, the transferability of Mapper hyperparameters, and the computational overhead of representation-aware monitoring.
We empirically benchmark SES against representative loss-based, slope-based, correlation-based, and activation-similarity baselines (Patience, Slope, CDSC, and SVCCA) on datasets with different dynamical regimes and noise levels, and characterize the regimes in which SES yields a favorable quality–efficiency trade-off rather than universal dominance.

The rest of the paper is organized as follows. Section 2 reviews related work, Section 3 presents the SES methodology and the experimental protocol, Section 4 discusses the empirical results, and Section 5 concludes.

2. Related Works

Neural sequence models as dynamical systems. Recurrent and attention-based sequence architectures have been consistently viewed as dynamical systems whose hidden states evolve along structured trajectories in latent space [8,9,10]. Although exact nonlinear descriptions remain limited, a common pattern is that, during training, hidden-state trajectories often evolve from high-variance or chaotic regimes toward more organized low-dimensional structures such as attractors, manifolds, or metastable regions [8,9,11]. This motivates the trend toward representation-aware early stopping. If the internal geometry of hidden trajectories becomes progressively more stable as useful learning saturates, then the changes in latent organization may provide an informative stopping signal complementary to the validation loss. Studies of recurrent and Transformer-style dynamics [8,9,11,12] have repeatedly suggested that hidden trajectories move from disorder to more structured regimes during training. This makes the evolution of internal geometry a plausible indicator of when the model has extracted most of the task-relevant structure from the data.

Topological criteria for early stopping. TDA-based metrics have been used to analyze neural representations and, in several studies, to define stopping criteria via persistent homology, neural persistence, or related complexity measures [13,14,15]. These approaches are conceptually appealing but can become computationally expensive and noise-sensitive in high-dimensional settings, especially when representations must be monitored epoch by epoch [16,17]. Mapper provides a useful alternative because it captures coarse topological organization without requiring the construction of full simplicial complexes [18,19]. On the other hand, Mapper is not a universal quantitative invariant: its output depends on the cover, filter, and clustering settings; hence, any stopping rule built on it should be interpreted as a robust structural proxy rather than as an accurate topological measurement [18,20].

Symbolic-dynamics criteria. Symbolic dynamics provides a complementary language for describing complex trajectories through finite symbol strings and descriptors based on information theory, such as Lempel–Ziv complexity, entropy rates, and ordinal-pattern statistics [21,22]. These descriptors appear to be appropriate for early stopping because they capture the transition from exploratory, irregular latent dynamics toward more repetitive and organized regimes. The main practical challenge is discretization: symbolic metrics are informative only if the partition of the latent space is stable and meaningful. Our approach addresses the above by constructing an epoch-specific Mapper partition directly on validation hidden states and then symbolizing the corresponding validation trajectories with respect to the partition. Previous research has also linked complexity-based indicators, including Lempel–Ziv type measures, to representation organization and overfitting-related transitions in neural systems [23,24].

Competing stopping rules. Classical patience-based criteria [1,2] were further complemented by trend-based rules, correlation-based rules, activation-similarity methods including SVCCA [25], and by ensembles of online indicators [26]. These more recent methods provide augmented views of convergence; however, they either rely on output-space behavior only or compare pointwise representations without explicitly tracking the organization of latent trajectories. The proposed SES is positioned between these lines of work: it remains validation-aware and computationally practical, while monitoring a structured representation signal rather than only loss values or pointwise activation similarity. Notably, the training curves may also exhibit double-descent-like behavior [27], which makes any single stopping indicator potentially unreliable. For this reason, we do not claim universal superiority over all baselines; instead, we investigate when a symbolic-topological view of representation dynamics yields a useful and interpretable stopping signal. Gradient-based component-wise stopping rules and layer-freezing strategies provide another relevant comparison point [28]. They can detect local changes in optimization dynamics, but they are not designed to represent the global geometry of hidden-state trajectories. More broadly, activation-comparison methods such as SVCCA offer useful information about representational stability, but direct pointwise comparison does not explicitly capture the connectivity and regime structure of latent trajectories [25]. In that sense, SES emphasizes structural organization rather than only pairwise similarity.

3. Methodology

3.1. Method Overview and Interpretation

A conceptual diagram of the proposed SES method is shown in Figure 2. The core SES pipeline includes: (1) training an RNN, BiRNN, or Transformer model; (2) extracting validation hidden states after each epoch; (3) building a Mapper graph over the resulting validation embedding cloud; (4) symbolizing the corresponding validation trajectories with respect to Mapper nodes; (5) computing a panel of symbolic and entropy-geometric descriptors; and (6) aggregating these descriptors into a scalar stopping signal.

At each epoch, SES observes the evolution of hidden-state trajectories from the validation set rather than only the validation loss. The symbolic score is treated as the primary representation-aware convergence signal, while the validation loss is retained as a conservative guard; training is therefore stopped only when symbolic stabilization is present and recent validation improvement is no longer material.

The intuition behind SES is straightforward. Early in training, hidden-state trajectories are often dispersed, irregular, and structurally unstable. As training progresses, the model learns a smaller number of more persistent latent regimes; transitions between them become more regular, and symbolic strings derived from these trajectories become easier to compress and characterize.

The SES, therefore, tracks structural stabilization rather than relying exclusively on the absolute value of a topological invariant. Mapper is used as a coarse but robust descriptor of the validation hidden-state cloud, while symbolic-dynamics metrics quantify how validation trajectories move across the induced regions of phase space.

Compared to simple clustering, Mapper provides a more stable structural scaffold near the useful-learning saturation zone, since it captures overlapping local organization and coarse connectivity rather than only centroid positions. This makes Mapper a natural basis for validation symbolization within each monitored epoch and for comparing neighboring epochs through the resulting descriptor history.

The residual instability of the symbolic basis in early epochs is handled through exponential smoothing and by excluding nearly flat metrics from the rank aggregator. As a result, SES is conceived as a practical convergence proxy for internal representations rather than as a brittle one-metric trigger.

3.2. Method Description

Formally, at epoch e, we consider the hidden-state tensor of the validation model:

H_{e} \in R^{B \times L \times D},

(1)

where B is the number of batches, L is the time window length, and D is the hidden-state dimension (or the selected layer in Transformer). The array H_e is treated as a cloud of embeddings describing the internal phase space of the model at the current epoch.

A Mapper graph is then constructed on this cloud. A filter function (the first principal component, local density, or another scalar functional) is selected; its values are covered by overlapping intervals (bins and overlap); clustering is performed within each interval (local_k); and nearby components are optionally merged using a threshold ε [16,18,20,29,30]. The resulting nodes are interpreted as coarse regions of latent phase space, with centroids defined in the original embedding space.

In our implementation, Mapper is built directly on the validation hidden-state cloud H_e at each epoch. As a lens function, we use the first principal component of the centered embeddings, computed by truncated SVD. This gives a deterministic one-dimensional projection and avoids adding a density-estimation step.

The lens range is then split into overlapping intervals. The main cover parameters are the number of bins and the overlap ratio. Inside each interval, we cluster the original hidden-state embeddings using single-linkage agglomerative clustering, with local_k controlling the local resolution. After this step, nearby clusters are merged if their centroids are closer than merge_eps. Each resulting Mapper node is represented by the centroid of its assigned embeddings, and edges are added between nodes that share samples through overlapping cover intervals.

These parameters determine how fine or coarse the symbolic partition will be. Increasing bins or local_k gives a more detailed partition and a larger symbolic alphabet, but it can also make the symbolic strings more sensitive to small local changes. The overlap parameter controls how smoothly neighboring regions are connected: too little overlap fragments the graph, while too much overlap may merge distinct regimes. The merge_eps parameter controls the stability of Mapper nodes across epochs: small values may leave unstable singleton nodes, whereas large values may collapse several regimes into one.

In the main experiments, we use bins = 8, overlap = 0.30, local_k = 10, and merge_eps = 0.50. This configuration was selected on the ETTh1 development benchmark as the best trade-off between stopping quality and epoch savings. The alternative Mapper configurations and their transfer behavior are reported in Section 4.5.

Symbolization maps hidden-state trajectories into strings over a finite alphabet: each state is assigned to the closest Mapper-node centroid. Each validation sequence is thus converted into a symbolic string over an alphabet of size K. From these strings, we compute a panel of descriptors intended to jointly capture compressibility/complexity, memory structure, and attractor geometry [21,22,31].

−: Lempel–Ziv complexity (LZ) as a measure of incremental string compressibility;
−: Markov entropy rate $h_{M}$ over the transition matrix and the stationary distribution;
−: permutation entropy (PermEn) over ordinal patterns;
−: correlation dimension $D_{2}$ over the correlation sum;
−: optionally, the fractal (box-counting) dimension $D_{F}$ of the set of visited states.

For each metric, values are first averaged across validation trajectories, then smoothed by an exponential moving average (EMA) with parameter α, and finally transformed into rank-based normalized quantities over the observed epoch history.

{\tilde{m}}_{e} = α m_{e} + (1 - α) {\tilde{m}}_{e - 1},

(2)

The empirical rank transform used by SES is defined over the metric history observed up to epoch e:

{\{{\tilde{m}}_{i}\}}_{i = 1}^{e} \to r (m_{e}) = \frac{(\sum_{i = 1}^{e - 1} [{\tilde{m}}_{i} \leq {\tilde{m}}_{e}])}{e},

(3)

s_{S E S, e} = m e d i a n (\{r (m_{e}) | m \in \{LZ, h_{M}, PermEn, D_{2}, D_{F}\}\}),

(4)

Next, the local range of each metric is estimated over a window of length W. A metric is included in the aggregator only if its absolute and relative variation exceed prespecified liveness thresholds; weakly varying indicators are treated as noise and excluded from the rank aggregation.

The aggregated symbolic score at epoch e is defined as the median rank over all active metrics:

s_{S E S, a g g r e g a t e d} = m e d i a n (\{r (m_{e}) | m \in \{LZ, h_{M}, PermEn, D_{2}, D_{F}\} \land m a l i v e\}),

(5)

Lower values of the final symbolic score indicate more ordered and stable internal dynamics. The stopping decision by SES is hybrid: symbolic-score stabilization is the primary stopping signal, while validation loss is retained as a conservative guard against stopping during materially improving validation performance.

The choice of the median over the rank panel as the aggregation operator is motivated by three considerations. First, the symbolic-dynamics descriptors LZ, hM, PermEn, D2, and (optionally) DF have different scales and noise profiles, so any aggregator operating on raw values is dominated by the descriptor with the largest variance. The empirical-rank transform of Equation (3) maps every descriptor onto the same uniform-on- [0, 1] scale and therefore makes the rank aggregator scale-invariant. Second, the median of K i.i.d. ranks has bounded influence with a breakdown point of 1/2, so a single descriptor that becomes uninformative or pathological in a given epoch (for example, the box-counting estimator on a near-degenerate visited-state set) cannot dominate the score; in contrast, the arithmetic mean has a breakdown point of 0 and is sensitive to single outliers. Third, the median commutes with monotone descriptor transformations and is therefore consistent with the rank normalization itself. The top-q family provides a controlled way to make the rank aggregation more aggressive: smaller q values emphasize the descriptors that stabilize earliest and therefore tend to trigger earlier stopping. The median-rank rule corresponds to the robust central tendency of the active descriptor panel, whereas top-q variants are treated as more aggressive alternatives rather than as identical estimators. A learned weighted sum was deliberately not adopted in the main pipeline because tuning its weights on the same data on which the stopping decision is made would introduce circular dependence between SES and the validation curve it is supposed to monitor. An offline empirical comparison of arithmetic-mean, median-rank, top-q, and inverse-variance weighted rank-sum variants is reported in Appendix G, Table A4 and Table A5, on the ETTh1 development benchmark across the three architectures used in this study; it confirms that median-rank provides a stable compromise between earliness and regret on the recurrent architectures and that, with the K = 4 active descriptors of the default panel, top-q with q ∈ {0.3, 0.5} reduces to the same statistic (the mean of the two smallest ranks) and therefore does not provide a distinct operating point in this configuration.

Throughout this paper, the symbolic score at epoch e is denoted by sSES,e (or, equivalently, Se in figures and pseudocode); the rank-aggregated form used in the hybrid rule is denoted by sSES,aggregated, and is defined by Equation (5). Hereinafter, Se is used as the shorthand for sSES,e.

We use the following two criteria for a stopping decision using SES.

Criterion (1) (symbolic stall).

We monitor an aggregate symbolic score defined as a rank aggregate (rank-median or top-q) over the smoothed active descriptors. Three control parameters are used: the minimum epoch at which symbolic stopping becomes admissible, the patience window P, and the minimum improvement threshold δ that separates meaningful score decreases from fluctuations.

− “min_epoch”—epoch number starting from which stopping is generally allowed;
− patience parameter P—maximum number of consecutive epochs without any significant improvement;
− minimum improvement threshold δ_sym, separating meaningful decreases in S_e from random fluctuations.

At each epoch, the current symbolic score is compared with the best symbolic score previously observed. If the decrease exceeds δ, the current epoch becomes the new symbolic best, and the no-improve counter is reset; otherwise, the counter is incremented. This produces a symbolic-stall indicator rather than a standalone stopping decision, since the final stop still depends on the validation guard.

e \geq \min_epoch, no_improve \geq P,

(6)

When this condition is met, the symbolic score is treated as stalled. This is close to classic patience-based early stopping [1,2,32], but applied to a symbolic scalar score that reflects the stabilization of the model’s internal dynamics rather than directly to the validation loss.

Criterion (2) (symbolic plateau/slope) estimates the local trend of the symbolic score over a sliding window. On the last W_slope points, we fit a simple linear regression and interpret its coefficient as the local slope of the curve.

S_{t} \approx a t + b,

(7)

A plateau stop is allowed when the absolute slope falls below a small threshold ε_slope, indicating that the symbolic score has effectively stabilized on the selected interval.

e \geq \min_epoch, ∣ a ∣ < ε_{slope},

(8)

that is, the symbolic score on the selected interval has effectively reached a plateau. This approach is consistent with more general stopping rules that use a smoothed error or residual trend as an indicator of the end of useful learning in iterative procedures and regularization [3,33]. In our case, the trend is estimated by S_e and implemented in a practical form through linear regression on a sliding window.

Combined rule. Training is stopped only when the symbolic signal has stabilized according to the symbolic-stall and/or symbolic-plateau logic, and the validation guard does not indicate that the validation loss is still improving materially. In this sense, SES is explicitly a hybrid stopping rule: the symbolic score provides the representation-aware monitoring signal, whereas the validation loss serves as a conservative guard against transient or overly aggressive stops.

e_{s t o p} = m i n \{e : (\exists t 1 \leq e : C r i t 1 (t 1) = 1) \land (\exists t_{2} \leq e : {C r i t}_{2} (t_{2}) = 1)\},

(9)

Pseudocode is provided in Appendix A. Algorithm A1 in Appendix A specifies the full SES procedure as a pseudocode listing, including the inputs (model, train, and validation loaders, Mapper hyperparameters, descriptor-panel parameters, smoothing constant α, rank-aggregator type, patience, and slope-window thresholds, and the validation-loss guard), the maintained internal state (best-score and best-validation epoch counters, no-improve counter, score history S, and validation history), and the per-epoch update rules. The algorithmic listing makes the order of operations explicit: hidden-state extraction → Mapper construction → symbolization → descriptor computation → smoothing and rank aggregation → symbolic-stall and symbolic-plateau checks → validation guard → combined stop. The implementation described in this study (feasible for RNN, BiRNN, or Transformer variants) is available in the project repository.

3.3. Metrics of Symbolic Dynamics

(a) Lempel–Ziv complexity (LZ). For each symbolic string s_(1:L), we compute the number of phrases c(s_(1:L)) using the classical LZ76 algorithm [34,35] and use the normalized estimate

L Z (s_{1 : L}) = \frac{c (s_{1 : L}) {l o g}_{K} L}{L},

(10)

where K is the alphabet size. The smaller the LZ, the more regular and predictable the symbolic trajectory. In the context of NSM, a decrease in LZ across epochs corresponds to a decrease in the “novelty” of internal trajectories and a transition to learned patterns [36,37,38].

(b) Markov entropy rate h_M. Symbolic trajectories are treated as samples from a finite Markov chain with transition matrix P = (p_ij) and stationary distribution π. The entropy rate is given by the standard formula

h_{M} = - \sum_{i = 1}^{K} π_{i} \sum_{j = 1}^{K} p_{i j} l o g p_{i j},

(11)

Smaller values of hM indicate that the trajectory “sticks” to a limited set of transitions typical of stable regimes, while high values correspond to more chaotic dynamics with a rich repertoire of transitions [21,22].

(c) Permutational entropy (PermEn) is constructed from the distribution of ordinal patterns of length m in a window with step τ. For each ordinal pattern π, the frequency p is estimated; the normalized entropy

P e r m E n = - \frac{1}{l o g (m!)} \sum_{π} p_{π} l o g p_{π},

(12)

takes values from 0 to 1. Values close to 1 correspond to dynamics close to “white” noise, while a decrease in PermEn indicates the emergence of stable deterministic orders in the sequence [22,39].

(d) Correlation dimension D₂. To estimate the geometric complexity of the attractor, we construct delay embeddings of the symbolic or embedded trajectory, x_t = (s_t, …, s_t+m), and compute the correlation sum

C (r) = \frac{2}{N (N - 1)} \sum_{i < j} 1 (∥ x_{i} - x_{j} ∥ < r),

(13)

We then estimate D₂ as the slope of the log C(r) versus log r curve over the selected scaling interval [r_min, r_max]. The resulting value is interpreted as the correlation dimension. A decrease in D₂ over epochs corresponds to a compaction of the visited hidden-state set and to the emergence of a lower-dimensional latent regime [21,31].

(e) Optional box-counting (fractal) dimension. To provide an additional geometric descriptor, we optionally estimate the box-counting dimension

D_{F}

of the visited-state set through the standard scaling relation

N (ε) \propto ε^{- D_{F}}

, where N(ε) is the minimum number of cells of size ε needed to cover the set.

D_{F} = \underset{ε \to 0}{l i m} \frac{l o g N (ε)}{l o g (1 / ε)},

(14)

The D_F is estimated as the slope of the approximately linear region in the plot of

l o g N (ε)

versus

l o g (1 / ε)

. It complements

D_{2}

by probing scale invariance rather than pairwise correlation structure.

All metrics are first averaged across sequences and batches, then normalized, smoothed with EMA, and filtered by liveness. The rank-median or top-q aggregator generates a single symbolic score, which is used by SES to make early-stopping decisions.

3.4. Datasets and Metrics

To evaluate SES, we used standard time-series benchmarks together with synthetic and real datasets representing different dynamical regimes. For convenience of interpretation, we grouped the datasets into quasi-periodic, intermediate, regime-switching, and near-chaotic cases.

The quasi-periodic group included the ETT (Electricity Transformer Temperature) family [40] and the monthly AirPassengers series [41]. The intermediate and regime-switching group included Bitcoin price data [42] and the DEAP EEG benchmark [43]. The near-chaotic case was represented by a synthetic Lorenz-system dataset constructed from the governing dynamics [44].

We compared early-stopping methods using a standard panel of model-selection metrics.

−: The epoch number

e_{stop},

(15)

at which the training was stopped, and the validation loss

v a l_l o s s (e_{stop}),

(16)

at the stopping point. Metrics were recorded for each ES method and for each model–dataset–seed triplet.

−: “Oracle epoch” e*, defined as the epoch of the global minimum of the validation loss over the entire learning horizon under consideration, for example, in the first E_max epochs,

e^{*} = a r g \underset{1 \leq e \leq E_{m a x}}{m i n} v a l_l o s s (e),

(17)

and regret

Δ B e s t,

Δ B e s t = v a l_l o s s (e_{stop}) - v a l_l o s s (e^{*}) \geq 0,

(18)

The smaller ΔBest, the closer the ES method is to ideal offline epoch selection.

−: Savings in training epochs,

$e p o c h s_s a v e d = e_{m a x} - e_{stop},$

(19)

Here, e_max denotes a common late-training reference, taken either as the maximum stopping epoch among the compared strategies for a given model–dataset pair, or as the fixed training horizon.

−: The proportion of runs in which the compared ES method is “close enough” to the oracle,

w i t h i n - ε - Oracle = \frac{1}{N} \sum_{i = 1}^{N} 1 (Δ {B e s t}_{i} \leq ε),

(20)

where >0 is a fixed threshold, 1(⋅) is an indicator function that takes the value 1 if the condition in Equation (20) is met, and 0 otherwise; N is the number of runs (seeds).

In addition, we used the coefficient of determination R2 [45] to evaluate how well the early symbolic features predict the aggregated final-quality indicators. For this purpose, we fitted a linear model from early-epoch features to final performance across seeds:

ŷ_{i} (E) = β_{0} + β^{T} φ_{i} (E),

(21)

where i indexes the run,

φ_{i} (E)

denotes the vector of early-epoch features computed up to epoch E, and

β_{0}

and

β

are regression parameters estimated across the runs. The coefficient of determination is then defined as

R^{2} (E) = 1 - \frac{\sum_{i = 1}^{N} (y_{i}^{*} - {\hat{y}}_{i} (E))^{2}}{\sum_{i = 1}^{N} (y_{i}^{*} - {\bar{y}}^{*})^{2}},

(22)

where

y_{i}^{*}

is the actual final quality of the i-th run,

i = 1, \dots, N

;

{\hat{y}}_{i} (E)

is the quality predicted from early features computed up to epoch

E

; and

{\bar{y}}^{*} = \frac{1}{N} \sum_{i = 1}^{N} y_{i}^{*}

is the average final quality across all runs.

Regression parameters are estimated by ordinary least squares (OLS). Let y be the target vector, X the design matrix augmented with an intercept column, and

θ = {(β_{0}, β^{T})}^{T}

. The normal equations is

X^{⊤} X \hat{θ} = X^{⊤} y,

(23)

and, when

X^{⊤} X

is invertible, the closed-form estimator is

{\hat{θ}}_{o l s} = (X^{⊤} X)^{- 1} X^{⊤} y,

(24)

To quantify variability across runs, we report the interquartile range (IQR), defined as

I Q R (x) = Q_{3} (x) - Q_{1} (x),

(25)

IQR measures the spread of the central 50% of observations and is substantially less sensitive to outliers than the full range.

3.5. Experimental Techniques

We conducted six main groups of experiments referred to as E1-E6, respectively. The core SES mechanism consisted of Mapper-based symbolization of validation hidden states, descriptor aggregation, and the hybrid stopping rule described above. Primary evaluation metrics were the stopping epoch e_stop, the validation loss at stop, the oracle regret ΔBest, the epochs_saved, and within-epsilon oracle frequency. Additional statistics such as R², IQR, and spread-based summaries were used only as secondary analysis tools and were not part of the stopping decision itself. Within this framework, E1 studied the variability of stopping times induced by individual symbolic-dynamics descriptors and by their ensembles. E2 compared SES with strong baseline stopping rules on the benchmark suite. E3 evaluated robustness under additive Gaussian noise. E4 was focused on the dependence of SES on network depth and on the monitored layer or block. E5 examined the transferability of the selected Mapper hyperparameters across architectures without any per-task retuning. E6 profiles the computational overhead of representation-aware monitoring and the resulting wall-clock trade-off.

The baselines span the loss-based and representation/correlation-based criteria: Patience [2], Slope-based stopping [32], SVCCA-style representation stabilization [25], and the Correlation-Driven Stopping Criterion (CDSC) [46]. For all baseline rules, hyperparameters are fixed before the final benchmark runs and are reported once for the shared protocol instead of being re-tuned separately for every individual series.

To enable independent replication of the experimental protocol, we summarize here the shared training-and-evaluation settings used in the E1–E6 groups. All models are trained for a fixed maximum horizon Emax = 100 epochs with the Adam optimizer (learning rate 10–3, batch size 64) under a single-seed sweep over N = 10 random seeds per (model, dataset, noise) cell. The recurrent baselines are single-layer LSTMs with hidden size 64; the bidirectional variant uses the same configuration with concatenated forward/backward states; the encoder-only Transformer uses 2 encoder blocks, 4 attention heads, model dimension 128, and mean-pooling over the sequence axis, and the prediction head is a single linear layer. Validation embeddings He used for SES are recomputed from a held-out validation split at the end of every epoch. The baseline stopping rules use the patience window p = 5 epochs, slope window Wslope = 5 epochs, slope threshold εslope = 10–3, CDSC correlation threshold 0.95 with patience 5, and SVCCA cosine-similarity threshold 0.99 with patience 5; these settings were fixed once on the development split (ETTh1, RNN, σ = 0) and reused across all benchmark cells without per-dataset retuning. SES uses min_epoch = 5, δ = 10–3, EMA α = 0.3, alphabet size K equal to the number of Mapper nodes per epoch, and the Mapper configuration of Section 3.2. For E3, additive Gaussian noise was injected into the z-normalized input series with σ ∈ {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}; σ = 0.0 corresponds to the clean benchmark. The complete configuration files, fixed seeds, training scripts, dataset preprocessing, and result aggregation are released in the project repository together with Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 to enable end-to-end reproduction. All experiments were implemented in Python\u00a03.10 with PyTorch\u00a02.1.0, NumPy\u00a01.24, scikit-learn\u00a01.3, and KeplerMapper\u00a02.0.1.

A detailed description of the experimental techniques is presented in Appendix B.

4. Results and Discussion

4.1. Variability of Symbolic-Dynamics Indicators (E1 Group of Experiments)

The first group of experiments compares individual symbolic-dynamics descriptors with their rank-aggregated ensemble. Across the benchmark suite, the single metrics generally display the same qualitative trend: they are high and unstable in the early epochs, then progressively decrease and eventually approach a plateau as the hidden-state organization becomes more stable. On the other hand, the exact saturation epoch differs from one metric to another and from one dataset to another, which motivates the use of a rank-aggregated ensemble rather than relying on any single symbolic descriptor.

The ensemble score used in SES is more stable across seeds than any single descriptor taken separately. In particular, the median stopping epochs tend to cluster around the visually identifiable knee of the metric trajectories, while the IQR of the ensemble-based stop is smaller than that for the most volatile individual metrics.

In all Mapper visualizations, hidden states are first projected into a low-dimensional lens (2D or 3D PCA), the lens space is covered by overlapping regions, the samples are clustered within each region, and the clusters are connected when they share samples across overlaps.

Across the datasets, the resulting graphs reveal different levels of regime complexity. AirPassengers produces a compact topology consistent with a strong trend and seasonality; the ETT series yields denser but still regular graphs associated with repeating local states; the Lorenz trajectory produces a more intricate multi-region topology characteristic of a chaotic attractor; and the EEG shows several well-separated clusters consistent with heterogeneous latent regimes.

The 2D views facilitate a cross-dataset comparison of the global connectivity and dominant transitions, whereas the 3D views are useful when 2D projections merge the nearby but distinct latent states. These visualizations are not stopping criteria per se, but they provide an intuitive image of the structural changes monitored by SES.

Overall, the E1 group of experiments shows that relying on a single symbolic descriptor appears to be risky, because different metrics may saturate at different times. Rank aggregation, together with smoothing and liveness filtering, substantially reduces the above variability and yields a more reproducible stopping signal.

4.2. Comparison with Baseline Early-Stopping Methods (E2 Group of Experiments)

The E2 group of experiments compares SES with Patience, Slope, CDSC, and SVCCA. The overall picture is consistent across the datasets: the preferred stopping rule depends on the dynamical regime and the model family, and SES should be interpreted as offering a competitive quality–efficiency trade-off rather than a universal optimum on every metric and every dataset.

A detailed description of the results of the E2 group experiments is presented in Appendix C.

Taken together, the results of the E2 group of experiments suggest that SES’s main advantage is a consistent tendency toward earlier stopping with interpretable representation-level diagnostics. On some benchmarks, this leads to a favorable regret-compute trade-off; on others, more conservative baselines remain preferable when minimizing regret is the primary objective.

When interpreting the E2 results, we separate three different objectives. The first one is epoch savings, defined by Equation (19). This metric shows how many training epochs are avoided relative to the common late-training reference, and it is summarized in Figure A1, Figure A4 and Figure A5 in Appendix C. The second one is total wall-clock time. This metric also includes the overhead of extracting hidden states and computing the SES monitoring signal, as reported in Table 1. Therefore, a method that saves more epochs is not always the fastest in real time. This is especially visible for the Transformer setting: SES still saves epochs, but its total wall-clock time is not always the smallest because the base Transformer epoch is already relatively short compared with the representation-monitoring overhead. The third objective is closeness to the oracle, measured by ΔBest in Equation (18) and by the within-ε-oracle frequency in Equation (20). This objective reflects stopping quality rather than computational savings.

Thus, the results should not be reduced to a single ranking of methods. A rule that stops very early can save many epochs but may increase oracle regret. Conversely, a conservative rule may stay closer to the oracle but lose much of the computational benefit. In this sense, SES is best understood as a method that usually provides strong epoch savings, remains competitive in wall-clock efficiency, and offers an interpretable representation-level signal, but does not universally dominate all baselines on regret.

For the descriptive statistics in Section 4.1, Section 4.2, Section 4.3, Section 4.4, Section 4.5 and Section 4.6, we report medians, interquartile ranges, and win-rates over N = 10 random seeds for each model–dataset–noise setting. To make this comparison more formal, Appendix F reports a per-cell comparison of SES with each baseline on the clean benchmark. Table A1 and Table A2 show the difference in medians between SES and each baseline for ΔBest and e_stop, together with approximate 95% confidence intervals. Table A3 reports an exact one-sided binomial sign-test over the 24 model × dataset cells.

The sign-test results support the same interpretation as the main experiments. SES stops significantly earlier than Patience, CDSC, and SVCCA, with p-values of 0.0001, 0.0033, and 0.0320, respectively. At the same time, SES does not consistently outperform Patience or CDSC on regret. Therefore, the statistical analysis confirms that SES should be interpreted as a quality–efficiency trade-off rather than as a universally optimal stopping rule. An exact paired Wilcoxon signed-rank analysis would require the per-seed values; these values are released together with the training logs in the project repository.

4.3. Robustness Under Additive Gaussian Noise (E3 Group of Experiments)

Detailed figures and tables for the robustness study reported in Appendix D show how the distributions of stopping epochs, regret, saved epochs, and early-prediction quality evolve as the input signal-to-noise ratio deteriorates.

Across all datasets, additive Gaussian noise leads to the expected increase in regret and in stopping-time dispersion. In most moderate-noise regimes, SES remains competitive and preserves a meaningful compromise between quality and epoch savings. In the hardest settings, however, SES can become either more conservative or more variable, and some loss-based baselines may achieve smaller regret. The reasonable interpretation would not be that SES is uniformly robust at every σ, but that it often degrades gradually and remains practically useful on a large subset of regimes.

4.4. Layer-Wise Analysis and Robustness Interpretation (E4 Group of Experiments)

The layer-wise experiments of the E4 group showed that representation-aware stopping can become informative before the final layer, especially in Transformer blocks where intermediate representations may stabilize earlier than the top representation used by the prediction head. This is methodologically important because it suggests that SES does not have to be limited to the final hidden state.

A detailed description of the results of the E4 group experiments is presented in Appendix E.

The results show that for all analyzed NSM types and datasets, intermediate layers provide the most reliable SES signal, reinforcing the conclusion that representation-aware stopping is often the strongest at an intermediate depth rather than at the output-proximal representation.

4.5. Mapper Hyperparameter Transfer Across Architectures (E5 Group of Experiments)

Table 2 presents the cover parameters (bins, overlap), local connectivity (local_k), and merge threshold (merge_eps), together with the resulting win rate against the baseline stopping rules, the median relative regret, and the median fraction of saved epochs. The selected configuration offers the most acceptable compromise between the regret and the epoch savings on the development benchmark; it cannot be read as a universally optimal Mapper setting.

Table 3 shows that the selected Mapper-based SES configuration transfers only partially across architectures. For RNN and BiRNN, SES often stops earlier than the strongest baseline while preserving a higher regret but with a substantially earlier stopping regret regime; for the Transformer, the transfer is weaker, and SES is better viewed as competitive rather than overtly superior.

Operationally, the Mapper hyperparameters control a bias–variance trade-off of the symbolic partition: bins and local_k determine the structural resolution, whereas overlap and merge_eps determine the topological smoothness and the cross-epoch stability.

4.6. Runtime Profiling and Global Robustness Interpretation (E6 Group of Experiments)

As the results in Table 1 show, the time spent on calculating metrics for early stopping in accordance with the SES method does not exceed 4.1% of the total setup time, regardless of the type of NSM, which seems quite reasonable in the overall balance of time spent on NSM setup.

4.7. Reproducibility and Limitations

For reproducibility, all benchmark comparisons were performed over multiple random seeds under a fixed maximum training horizon, using the same logged runs within each model–dataset comparison. The stopping metrics reported in the tables are defined consistently across methods: e_stop denotes the stopping epoch, val_at_stop refers to the validation loss at that epoch,

∆

Best is the gap to the oracle validation minimum over the allowed horizon, and epochs_saved represents the savings relative to the common reference horizon defined in Section 3.4. Auxiliary diagnostics such as R² and IQR are used only for analysis and are not part of the stopping rule itself.

Unless explicitly stated otherwise, table entries referring to val_at_stop or

∆

Best are computed at the stopping epoch itself rather than after post-hoc checkpoint restoration. The code base, experiment scripts, and fixed decision thresholds used for the reported runs are available in the respective repository.

The study has several limitations. First, SES depends on the stability of latent-state trajectories and on the choice of Mapper hyperparameters. Second, its operating point is intentionally aggressive, which may increase regret relative to more conservative baselines. Third, the present evaluation covers several model families and datasets, but it does not establish universal superiority across architectures, noise regimes, and sequence tasks.

A further practical limitation is that representation-aware monitoring incurs measurable overhead and can become less stable in the hardest noisy settings, especially when latent trajectories remain weakly organized or when Mapper partitions fluctuate across epochs.

The limitations of SES can also be expressed quantitatively. First, SES is sensitive to Mapper hyperparameters. In the ETTh1 development sweep with RNN and σ = 0, the win-rate across the four tested Mapper configurations changes from 0.067 to 0.200, while the median relative regret changes from 0.467 to 1.016 (Table 2). This shows that the choice of bins, overlap, local_k, and merge_eps is not arbitrary and can noticeably affect the stopping behavior.

Second, SES has an intentionally aggressive operating point. On ETTh1, the median ΔBest of SES is 2.91 × 10⁻³ for RNN and 2.17 × 10⁻³ for BiRNN, whereas the strongest baseline achieves 5.76 × 10⁻⁴ and 1.02 × 10⁻³, respectively (Table 3). At the same time, SES saves 18–19 more epochs than the corresponding baseline. Thus, SES usually trades a larger regret for earlier stopping.

Third, representation-aware monitoring introduces additional wall-clock cost. The per-epoch overhead is at most 4.1% across the three architectures (Table 1), which is relatively small. However, this overhead can still matter when the base model is cheap to train. For example, in the Transformer setting, the total SES wall-clock time is 1195.7 s, whereas Patience and Slope require about 400 s. Therefore, epoch savings and real-time savings should not be treated as the same objective.

Fourth, the current evaluation is still limited in scope. The experiments cover three model families, five dataset groups, eight individual datasets when the ETT family is counted separately, and multiple noise regimes. The clean benchmark contains 24 model × dataset cells. This is sufficient to analyze the proposed mechanism, but it does not yet establish generality for large foundation models, state-space models, or non-temporal sequence tasks.

These results are consistent with prior work on representation dynamics and topological monitoring. The layer-wise analysis in Section 4.4 supports the idea that useful intermediate representations may stabilize before the final validation curve, which extends earlier observations about layer-wise feature formation in deep networks [11,13,14]. The behavior of SVCCA is also expected: SVCCA was designed to measure representational similarity, not to stop training aggressively, and in our experiments, it often stops close to the full training horizon [25]. Finally, the weaker behavior of SES on Lorenz compared with quasi-periodic ETT data is consistent with the sensitivity of symbolic-dynamics descriptors, such as LZ and PermEn, to high-entropy regimes [21,22,36,37].

These observations define the main directions for future work. First, the descriptor panel can be extended with attention-flow descriptors for Transformer models. Second, the rank aggregation can be replaced by a learned descriptor-attention mechanism when enough runs are available to avoid circular dependence on the validation curve. Third, SES can be combined with checkpoint restoration, so that the stopping epoch and the returned model checkpoint are selected separately.

5. Conclusions

In this work, we introduced Symbolic Early Stopping (SES) for neural sequence models (NSM), a hybrid representation-aware stopping criterion that complements loss-based early stopping by explicitly monitoring the structural evolution of hidden-state dynamics during validation. SES combines validation-based Mapper symbolization with a compact panel of symbolic and entropy-geometric descriptors and aggregates them into a single score that acts as a practical proxy for representation stabilization, while retaining the validation loss as a conservative guard.

The empirical study across RNN, BiRNN, and Transformer architectures indicated that SES provides a competitive and practically applicable trade-off between the predictive quality and the computational cost rather than universal dominance over the alternative stopping rules. On structured benchmarks, especially from the ETT family, SES often stops substantially earlier than the full training budget is exhausted, but the related regret ranges from small to clearly visible depending on the architecture and the noise level. On more irregular and regime-switching data, such as Bitcoin and EEG, SES remains usable but exhibits larger variability and stronger dependence on the monitoring hyperparameters.

The robustness study with additive Gaussian noise confirmed that all early-stopping criteria deteriorate as the signal-to-noise ratio decreases. SES often degrades gradually and preserves non-trivial epoch savings in a large subset of the tested settings, but the hardest noisy regimes reveal its limitations: in some cases, the stopping distribution broadens, and in others, more conservative baselines achieve smaller regret.

Another important conclusion is that different competing criteria fail differently. Aggressive methods may occasionally obtain a low regret by stopping extremely early, while activation-similarity methods may collapse to stopping near the full training budget and therefore lose practical value as early-stopping rules. SES occupies a middle ground by providing a structured and interpretable stopping signal tied to latent representation dynamics, rather than a uniformly dominant one.

The hyperparameter-transfer and layer-wise experiments further showed that SES is sensitive to the choice of symbolic partition and to the monitored representation, but that this sensitivity is manageable and can be studied systematically rather than treated as an opaque implementation detail.

Overall, SES appears to be an interpretable representation-aware stopping strategy that is especially useful in scenarios where the validation loss alone is insufficient to reliably identify the end of useful learning. It is not a universal replacement for classical early stopping, but a methodologically grounded complement whose main value lies in the quality-compute-interpretability trade-off.

The three aims stated in Section 1 are addressed as follows. First, SES explicitly monitors latent-state stabilization through the descriptor panel and not only through validation loss. The layer-wise results in Section 4.4 show that this representation-level signal is informative across all three model families and can already appear in intermediate layers.

Second, SES remains computationally practical for epoch-wise monitoring. The additional per-epoch overhead is at most 4.1% (Table 1), and on ETTh1 the method saves a median of 82–85 epochs out of 100 depending on the architecture (Table 3). However, this does not mean that SES is always the fastest method in wall-clock time, especially for cheap base models.

Third, the stopping signal remains interpretable because it is built from standard symbolic-dynamics descriptors: LZ, h_M, PermEn, D_2, and optionally D_F. In Section 4.1, the visible stabilization of these metric trajectories is consistent with the SES stopping point on AirPassengers, ETT, EEG, and Lorenz.

Thus, the main aim of the work is achieved: SES provides an interpretable representation-aware stopping rule with a clear quality–efficiency trade-off. At the same time, the experiments do not support a claim of universal dominance over all baselines, and we do not make such a claim.

Future work would focus on reducing the computational overhead of Mapper construction and symbolization, learning adaptive thresholds and aggregation weights, extending the evaluation to larger and more diverse sequence models, and investigating how representation-aware stopping interacts with multi-step forecasting and multimodal sequence tasks.

Author Contributions

Conceptualization, I.T., R.Z., N.G. and A.V.; methodology, I.T. and R.Z.; software, R.Z.; validation, I.T. and R.Z.; formal analysis, I.T., R.Z. and N.G.; investigation, all authors; writing—original draft preparation, R.Z. and I.T.; writing—review and editing, all authors; visualization, R.Z.; supervision, A.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Economic Development of the Russian Federation (IGK 000000C313925P4C0002), agreement No. 139-15-2025-010.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it used publicly available benchmark datasets and did not involve the new collection of human or animal data by the authors.

Informed Consent Statement

Not applicable.

Data Availability Statement

Public benchmark datasets used in this study are available from their original sources as cited in the references. Code and experiment scripts are available in the project repository: https://github.com/RoMoRoToR/NeuroSymbolicDynamics.git (accessed on 27 March 2026).

Acknowledgments

The authors thank the maintainers of the public benchmark datasets and the open-source software used in the experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Algorithmic Specification of SES

This appendix provides an algorithmic specification of the SES procedure used throughout the experiments. Let H_e denote the validation hidden states at epoch e, G_e the Mapper graph built from H_e, C_e the corresponding node centroids, S_e the symbolic strings obtained from validation trajectories, and score_e the aggregated symbolic score. The history of aggregated scores is denoted by S.

SES uses the symbolic score as the primary monitoring signal and validation loss as a conservative guard. Unless explicitly stated otherwise, quantities such as val_at_stop and DeltaBest are computed at the stopping epoch itself rather than after post-hoc checkpoint restoration. The optional restoration step in Algorithm A1 is operational and does not redefine the reported stop-epoch metrics.

Symbolic Early Stopping (SES, Hybrid Rule)

Algorithm A1: Symbolic Early Stopping (SES)

input: model, train_loader, val_loader,
mapper_params, metrics_params, α,
min_epoch, P, W_slope, ε_slope, δ,
val_guard_abs, val_guard_rel, guard_win

Output: e_stop, best_epoch, best_val

1: Initialize: best_score ← +∞, best_val ← +∞,
score_no_improve ← 0, S ← [], e_stop ← E
2: for e = 1 to E do
3: Train one epoch
4: H_e ← collect_hidden(val_loader)
5: nodes, assign ← build_mapper(H_e, mapper_params)
6: S_str ← symbolize(assign)
7: feats ← compute_panel(S_str, metrics_params)
8: feats_sm ← EMA(feats, α)
9: score_e ← rank_aggregate(feats_sm)
10: Append score_e to S; val_e ← evaluate_val_loss()
11: if score_e < best_score − δ then
12: best_score ← score_e; score_no_improve ← 0
13: else score_no_improve ← score_no_improve + 1
14: if val_e < best_val then best_val ← val_e; best_epoch ← e
15: stall ← (e ≥ min_epoch) ∧ (score_no_improve ≥ P)
16: plateau ← (e ≥ min_epoch) ∧ flat_slope(S, W_slope, ε_slope)
17: guard ← (e ≥ min_epoch) ∧ ¬guard_improve(val_history, …)
18: if (stall ∨ plateau) ∧ guard then
19: e_stop ← e; break
20: end for
21: Optionally: restore checkpoint(best_epoch)
22: return e_stop, best_epoch, best_val, S

Appendix B. Detailed Description of the Experimental Techniques

In E1, all symbolic descriptors were computed for each model–dataset–seed triplet. We summarized the resulting stopping behavior with medians, IQRs, standard deviations, and, where informative, R² values measuring how well early symbolic descriptors predict final-quality indicators. This experiment was intended to justify the use of a descriptor panel rather than a single metric.

In E2, we trained three model families—RNN, BiRNN, and encoder-only Transformer—under a shared training budget and compared SES with Patience, Slope, CDSC, and SVCCA methods. All stopping rules were evaluated on the same logged runs and compared by the stopping epoch, the validation loss at stop, the oracle regret, the within-ε frequency, and saved epochs.

In E3, additive Gaussian noise with increasing variance was injected into the input series to evaluate robustness to the degradation of the signal-to-noise ratio. For each noise level, model, and dataset, we repeated the same training-and-evaluation protocol and assessed how the distributions of stopping epochs, regret, and saved epochs evolve.

In E4, we tested whether the SES signal is equally informative across network depth by extracting embeddings from different layers or blocks and comparing the resulting stopping behavior. This experiment addressed whether the representation-aware stopping is strongest at the final layer or may already have become informative earlier in the network.

In E5, we examine whether the selected Mapper hyperparameters remain useful across architectures without any per-task retuning. After choosing a development configuration on ETTh1 through a sweep over the main cover and merge settings, we apply the same Mapper configuration to RNN, BiRNN, and Transformer models on the remaining benchmark tasks. For each architecture, we record the stopping epoch, the validation loss at stop, the oracle regret, and the saved epochs. This experiment tests whether a single symbolic partitioning regime can transfer across model families while preserving a practically useful stopping signal.

In E6, we quantify the computational overhead of representation-aware monitoring and assess its wall-clock trade-off relative to earlier stopping. Using the noise-free ETT benchmark, we profile the per-epoch training time, the validation time, the hidden-state extraction time, and the time spent by the stopping rule itself for all three model families. We then compare the additional monitoring time cost with the training time saved through earlier termination. This experiment is intended to determine whether the extra representation-level computations required by SES are justified by the resulting reduction in the total training time.

By default, all method thresholds are fixed before the final evaluation and are not re-tuned separately for every target series. This is important in order to avoid overfitting the early-stopping rule itself to a particular benchmark.

Appendix C. Detailed Description of the Results of the E2 Group Experiments

Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7, Figure A8, Figure A9 and Figure A10 present the detailed per-dataset results for the E2 group of experiments. For each of the five benchmark datasets, two boxplot panels are provided: the distribution of saved epochs (measuring how much earlier SES and the baselines stop relative to the full training horizon) and the distribution of ΔBest (the gap between the validation loss at the stopping point and the oracle validation minimum over the allowed horizon, measuring stopping quality). Each boxplot summarizes N = 10 random seeds; the three columns within each panel correspond to RNN, BiRNN, and Transformer, respectively. A method in the upper-left region of the epoch-savings panel stops early, and conserves compute; a method with a compact low-ΔBest distribution stays close to the oracle. No single method simultaneously dominates both objectives across all datasets, which is why the figures are presented in pairs.

In the ETT family, SES usually stops much earlier than the full training budget is exhausted. Depending on the architecture, this yields a useful trade-off between the saved epochs and a small-to-moderate regret, but the method cannot be described as uniformly oracle-close on these benchmarks because Patience, CDSC, or Slope can achieve clearly smaller ΔBest in several configurations.

Figure A1. Distribution of saved epochs on the ETT datasets.

Figure A1 shows that SES typically maintains a high median of saved epochs with a moderate dispersion across seeds, indicating that it reduces computational cost in a stable manner. Patience is generally more conservative, while some alternative rules become either more variable or more aggressive as the regime becomes noisier.

Figure A2. Distribution of ΔBest on the ETT datasets.

Figure A2 indicates the expected quality degradation under noise for all methods. For SES, the median regret often grows gradually rather than abruptly; however, the clean-benchmark tables also show that this controlled degradation does not imply a best-in-class ΔBest on every ETT configuration.

The Lorenz trajectory provides the clearest chaotic test case. Here, the trade-off between stopping early and staying close to the oracle becomes especially delicate. SES often identifies a stabilization phase very early, but on the most sophisticated variants, this early stop can be more aggressive than the lowest-regret alternative.

Figure A3. Distribution of ΔBest on Lorenz trajectory (val_at_stop—oracle_val).

For the Lorenz trajectory, the regret distributions remain compact for BiRNN, whereas RNN and especially Transformer show a wider variability. Here, Slope is often closer to the oracle in terms of median regret, but this advantage is achieved by a later stopping.

Figure A4. Distribution of saved epochs on the Lorenz trajectory.

Figure A4 makes the quality–efficiency trade-off explicit: SES and, in some cases, SVCCA stop extremely early and save many epochs, while Slope is much more conservative. Patience and CDSC typically occupy an intermediate position between these extremes.

Taken together, the Lorenz trajectory results show that on strongly chaotic dynamics, the cost of very early stopping becomes more visible: SES can maximize epoch savings, but later criteria may occasionally achieve smaller regret.

AirPassengers illustrates a different failure mode. The dataset is small and quasi-periodic, and the usefulness of an early stopping strongly depends on the architecture. On recurrent models, several criteria, including SES, often do not stop substantially prior to the full budget being exhausted, whereas on the Transformer, many rules stop very early but not always at an appropriate point.

Figure A5. Distribution of saved epochs on AirPassengers.

On AirPassengers, SES saves almost no epochs for RNN and BiRNN, which is consistent with stopping near the training horizon. By contrast, SVCCA is extremely aggressive on these recurrent models, while the Transformer configuration exhibits genuinely early stopping for SES, Patience, and several baselines.

The corresponding regret plot shows why aggressive stopping is risky on this AirPassengers dataset. For RNN and BiRNN, very early strategies, especially SVCCA and, in some runs, CDSC, can incur large regret, whereas methods that stop late naturally remain close to the oracle. For the Transformer, SES achieves early stopping but can become prematurely aggressive compared to more conservative alternatives.

Figure A5 confirms that, on AirPassengers, the practical behavior of ES is architecture-dependent: on recurrent models, conservative rules preserve quality by stopping late, while on the Transformer, the main question is which criterion offers the best compromise between a very early stop and an acceptable regret.

Bitcoin serves as a non-stationary stress test with regime shifts. In this case, ordering of internal representations progresses more unevenly, so the main basis for comparison is not only the final regret but also the consistency with which a method produces a genuinely early stop.

Figure A6. Distribution of ΔBest on AirPassengers (val_at_stop—oracle_val).

Figure A7. Distribution of ΔBest on BTC (val_at_stop − oracle_val).

For BTC, most of the SOTA methods remain in a moderate-regret regime on a substantial fraction of runs, especially for RNN and BiRNN. SES usually avoids systematic catastrophic failures; however, its quality advantage over Patience or CDSC is not consistent, particularly for the Transformer and in the noisier settings.

Figure A8. Distribution of saved epochs on BTC.

Figure A8 appears to be more discriminative than Figure A7: SES often delivers a genuinely early stop on BTC, but the main benefit is computing savings and interpretability rather than uniform regret minimization.

EEG is high-dimensional, noisy, and only weakly stationary, so symbolic metrics stabilize later than on ETT or AirPassengers. This makes EEG one of the hardest datasets for SES and for the baselines alike.

Figure A9. Distribution of ΔBest on EEG (val_at_stop—oracle_val).

On EEG, SES remains usable as an aggressive early-stopping rule, particularly for the recurrent models, but the quality advantage over conservative baselines is not uniform. For the Transformer, Patience and CDSC can equal or improve upon SES in median regret, while SES still retains the benefit of earlier stopping and representation-level diagnostics.

Figure A10. Distribution of saved epochs on the EEG.

Despite the difficulty of the EEG dataset, SES still produces a genuinely early stop with high median epoch savings. However, when minimizing regret is the primary objective, the more conservative baselines may remain preferable on several EEG configurations.

Figure A6, Figure A7, Figure A8, Figure A9, Figure A10, Figure A11, Figure A12, Figure A13, Figure A14, Figure A15, Figure A16, Figure A17, Figure A18, Figure A19, Figure A20, Figure A21 and Figure A22 present the detailed per-dataset results for the E3 group of experiments. For each of the six benchmark datasets, two panels are provided: the distribution of saved epochs as a function of noise level σ, and the distribution of ΔBest as a function of σ. The noise levels tested are σ ∈ {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}, where σ = 0.0 corresponds to the clean benchmark of Appendix C. Additive Gaussian noise is injected directly into the z-normalized input series before training; all other settings are identical to the E2 protocol. The figures show how the stopping behavior and regret of each rule evolve as the signal-to-noise ratio deteriorates, and which methods degrade most gracefully.

Figure A11. ETTm1: distribution of saved epochs versus noise level.

Figure A12. ETTm1: distribution of ΔBest versus noise level.

Figure A13. ETTh1 (all models): distribution of ΔBest versus noise level.

Figure A14. Complements the regret view: SES usually keeps a practically meaningful level of epoch savings even as the noise increases, though variability grows and occasional late stops appear in the most sophisticated cases.

Figure A15. AirPassengers: distribution of saved epochs versus noise level.

Figure A16. AirPassengers: distribution of ΔBest versus noise level.

Figure A17. Lorenz trajectory: distribution of saved epochs versus noise level.

Figure A18. Lorenz trajectory: distribution of ΔBest versus noise level.

Figure A19. BTC-15m: distribution of ΔBest versus noise level.

Figure A20. BTC-15m: distribution of saved epochs versus noise level.

Figure A21. EEG: distribution of saved epochs versus noise level.

Figure A22. EEG: distribution of ΔBest versus noise level.

Appendix D. Detailed Description of the Results of the E3 Group Experiments

The results of the E3 group of experiments are presented below.

For ETTm1, SES preserves substantial epoch savings over a broad noise range. At the highest noise levels, however, it can become markedly more conservative and occasionally stop close to the training horizon.

A notable contrast is SVCCA, which in noisy regimes often collapses to stopping near the full budget and therefore loses practical value as an early-stopping strategy.

The corresponding regret plot shows a gradual increase in ΔBest with the increasing noise. For SES, this increase is typically smooth rather than abrupt, suggesting that much of the degradation comes from the data becoming less informative rather than from an unstable stopping signal.

For ETTh1, low-noise regimes leave most methods close to the oracle. As the noise increases, the differences become clearer through widening IQRs and heavier tails. SES generally exhibits controlled degradation, although at the largest σ, some baselines may achieve a smaller regret on particular architectures.

On AirPassengers, many methods continue to stop very early under noise, but high savings do not automatically imply a good stopping decision. In this quasi-periodic regime, overly aggressive stops can produce a disproportionate quality loss.

The regret distributions tend to confirm this concern. For SES and several other aggressive methods, higher noise can induce premature stopping and a noticeable regret growth, whereas more conservative rules reduce the risk of essential mistakes by stopping later.

For the Lorenz trajectory, average epoch savings remain high for many methods even under noise, but the dispersion of stopping decisions grows. SES stays early on average, while SVCCA frequently ceases to behave like a practical early-stopping rule because it often fails to trigger.

The main robustness issue on the Lorenz trajectory is not a large shift in the median regret but the appearance of heavy tails and occasional large misses. SES remains near the oracle on many runs, yet high-noise chaotic trajectories can still produce outliers.

On BTC, the median regret often remains moderate even as the noise increases, but rare large outliers become more visible. This is consistent with the regime-switching nature of the series: some runs follow markedly different optimization trajectories under noise.

The main practical difference in BTC lies in the stability of epoch savings. SES usually keeps a strong early-stopping effect, but at high noise, it becomes more variable and may occasionally shift toward late stopping.

EEG is the most challenging robustness setting. SES still provides a meaningful reduction in the training cost, but the spread of saved epochs grows with noise, and some regimes show noticeably less stable stopping decisions.

The EEG regret distributions show that the main effect of noise is a growth of tail risk rather than only a shift of the median. SES remains useful, but in the highest-noise regimes, it becomes more variable and can provide its early stops at the expense of a larger regret than the most conservative alternatives.

Appendix Е. Detailed Description of the Results of the E4 Group Experiments

For the Transformer on BTC_15m, intermediate blocks (approximately ℓ = 1–4) achieve the best quality–efficiency balance. Figure A23 shows that these blocks produce lower and more stable ΔBest than either the earliest monitored representation (ℓ = 0) or the last block (ℓ = 5).

Figure A24 shows the efficiency side of the same trade-off: the intermediate blocks typically yield the largest epoch savings, implying that SES can stop substantially earlier while keeping regret low.

Figure A23. ΔBest vs. layer (BTC_15m, Transformer).

Figure A24. epochs_saved vs. layer (BTC_15m, Transformer).

Figure A25 explains the origin of this advantage. Intermediate blocks tend to trigger earlier and with smaller dispersion across seeds, whereas the final block exhibits a much wider stopping-time distribution. This is consistent with the idea that output-proximal representations remain more sensitive to late-stage fitting and stochastic fluctuations.

Figure A25. stop epoch vs. layer (BTC_15m, Transformer).

The same analysis for the bidirectional recurrent baseline leads to a similar conclusion. Figure A26, Figure A27 and Figure A28 show that intermediate recurrent layers yield a lower and a more stable regret, save more epochs, and stop with smaller cross-seed dispersion than the deepest monitored layer.

Figure A26. ΔBest vs. layer (BTC_15m, RNN_BiLSTM).

Figure A27. epochs_saved vs. layer (BTC_15m, RNN_BiLSTM).

Figure A28. Stop epoch vs. layer (BTC_15m, RNN_BiLSTM).

The unidirectional recurrent baseline reproduces the same qualitative pattern. Figure A29, Figure A30 and Figure A31 show that intermediate layers again provide the most reliable SES signal, reinforcing the conclusion that representation-aware stopping is often the strongest at an intermediate depth rather than at the output-proximal representation.

Figure A29. ΔBest vs. layer (BTC_15m, RNN_LSTM).

Figure A30. epochs_saved vs. layer (BTC_15m, RNN_LSTM).

Figure A31. stop epoch vs. layer (BTC_15m, RNN_LSTM).

Appendix F. Statistical Comparison of SES with Baseline Stopping Rules

Table A1, Table A2 and Table A3 complement the descriptive statistics of Section 4.2 by reporting a per-cell comparison of SES against each of the four baseline stopping rules (Patience, Slope, CDSC, SVCCA) on the clean benchmark (N = 10 seeds per cell). Table A1 reports the per-cell comparison on regret (ΔBest, smaller is better). Table A2 reports the per-cell comparison on stopping epoch (e_stop, smaller is better). Table A3 aggregates the sign of the median differences across 24 (model × dataset) cells and reports an exact one-sided binomial sign-test p-value for each (baseline, metric) pair.

Methodological note on uncertainty quantification. The 95% confidence intervals reported in Table A1 and Table A2 are constructed under a Normal-approximation assumption: for a sample of size N from a near-Normal distribution, the standard error of the median is approximately 1.2533 × σ/√N, with σ ≈ IQR/1.349. For median-difference uncertainty, we use the Wald-type pooled standard error √(SE_SES² + SE_base²); this is conservative. The sign-test p-values in Table A3 are exact one-sided binomial tail probabilities P(K ≥ wins | n, p = 0.5) and do not depend on any distributional approximation. An exact paired Wilcoxon signed-rank analysis requires the raw per-seed values available in the project repository.

Table A1. Per-cell comparison on ΔBest (regret, smaller is better). “SES wins?” marks the direction of the median difference; ✓ indicates SES has the smaller median.

Dataset	Model	Baseline	SES ΔBest, med [IQR]	Baseline ΔBest, med [IQR]	Δmed (SES−Base) [95% CI]	SES wins?
ETTh1	RNN	Patience	2.7623 [3.1851]	0.1127 [9.1169]	+2.6496 [−2.9115, 8.2106]	✗
ETTh1	RNN	Slope	2.7623 [3.1851]	2.7804 [6.4139]	−0.0181 [−4.1418, 4.1056]	✓
ETTh1	RNN	CDSC	2.7623 [3.1851]	2.7768 [2.6655]	−0.0145 [−2.4062, 2.3771]	✓
ETTh1	RNN	SVCCA	2.7623 [3.1851]	0 [0]	+2.7623 [0.9281, 4.5964]	✗
ETTh1	BiRNN	Patience	0.1134 [0.1199]	0.1110 [0.1110]	+0.0025 [−0.0916, 0.0966]	✗
ETTh1	BiRNN	Slope	0.1134 [0.1199]	0.1370 [0.2410]	−0.0236 [−0.1786, 0.1315]	✓
ETTh1	BiRNN	CDSC	0.1134 [0.1199]	0.1123 [0.0971]	+0.0011 [−0.0877, 0.0900]	✗
ETTh1	BiRNN	SVCCA	0.1134 [0.1199]	0 [0]	+0.1134 [0.0444, 0.1825]	✗
ETTh1	Transformer	Patience	0.2740 [0.0793]	0.0614 [0.3956]	+0.2126 [−0.0197, 0.4449]	✗
ETTh1	Transformer	Slope	0.2740 [0.0793]	0.3424 [0.1036]	−0.0683 [−0.1434, 0.0068]	✓
ETTh1	Transformer	CDSC	0.2740 [0.0793]	0.2318 [0.0719]	+0.0422 [−0.0194, 0.1038]	✗
ETTh1	Transformer	SVCCA	0.2740 [0.0793]	0 [0]	+0.2740 [0.2284, 0.3197]	✗
ETTh2	RNN	Patience	0.2104 [0.1514]	0.1887 [0.1568]	+0.0217 [−0.1038, 0.1472]	✗
ETTh2	RNN	Slope	0.2104 [0.1514]	0.2351 [0.2935]	−0.0248 [−0.2150, 0.1654]	✓
ETTh2	RNN	CDSC	0.2104 [0.1514]	0.1990 [0.1450]	+0.0114 [−0.1093, 0.1321]	✗
ETTh2	RNN	SVCCA	0.2104 [0.1514]	0 [0]	+0.2104 [0.1232, 0.2975]	✗
ETTh2	BiRNN	Patience	0.2078 [0.1601]	0.1995 [0.1736]	+0.0084 [−0.1276, 0.1444]	✗
ETTh2	BiRNN	Slope	0.2078 [0.1601]	0.2429 [0.3041]	−0.0351 [−0.2329, 0.1628]	✓
ETTh2	BiRNN	CDSC	0.2078 [0.1601]	0.2063 [0.1538]	+0.0016 [−0.1263, 0.1294]	✗
ETTh2	BiRNN	SVCCA	0.2078 [0.1601]	0 [0]	+0.2078 [0.1157, 0.3000]	✗
ETTh2	Transformer	Patience	0.2575 [0.0813]	0.0737 [0.3992]	+0.1838 [−0.0508, 0.4184]	✗
ETTh2	Transformer	Slope	0.2575 [0.0813]	0.3107 [0.1131]	−0.0532 [−0.1334, 0.0270]	✓
ETTh2	Transformer	CDSC	0.2575 [0.0813]	0.2147 [0.0749]	+0.0428 [−0.0209, 0.1064]	✗
ETTh2	Transformer	SVCCA	0.2575 [0.0813]	0 [0]	+0.2575 [0.2107, 0.3043]	✗
ETTm1	RNN	Patience	0.5307 [0.1422]	0.5122 [0.1548]	+0.0184 [−0.1026, 0.1395]	✗
ETTm1	RNN	Slope	0.5307 [0.1422]	0.5672 [0.3073]	−0.0365 [−0.2315, 0.1584]	✓
ETTm1	RNN	CDSC	0.5307 [0.1422]	0.5127 [0.1391]	+0.0179 [−0.0966, 0.1324]	✗
ETTm1	RNN	SVCCA	0.5307 [0.1422]	0 [0]	+0.5307 [0.4488, 0.6125]	✗
ETTm1	BiRNN	Patience	0.5338 [0.1465]	0.5183 [0.1764]	+0.0155 [−0.1166, 0.1476]	✗
ETTm1	BiRNN	Slope	0.5338 [0.1465]	0.5960 [0.3226]	−0.0622 [−0.2662, 0.1419]	✓
ETTm1	BiRNN	CDSC	0.5338 [0.1465]	0.5278 [0.1477]	+0.0060 [−0.1138, 0.1258]	✗
ETTm1	BiRNN	SVCCA	0.5338 [0.1465]	0 [0]	+0.5338 [0.4495, 0.6182]	✗
ETTm1	Transformer	Patience	0.4262 [0.0827]	0.2519 [0.4016]	+0.1743 [−0.0617, 0.4104]	✗
ETTm1	Transformer	Slope	0.4262 [0.0827]	0.4914 [0.1127]	−0.0652 [−0.1457, 0.0153]	✓
ETTm1	Transformer	CDSC	0.4262 [0.0827]	0.3888 [0.0771]	+0.0374 [−0.0277, 0.1025]	✗
ETTm1	Transformer	SVCCA	0.4262 [0.0827]	0 [0]	+0.4262 [0.3786, 0.4738]	✗
ETTm2	RNN	Patience	0.5793 [0.1475]	0.5562 [0.1653]	+0.0231 [−0.1045, 0.1507]	✗
ETTm2	RNN	Slope	0.5793 [0.1475]	0.6247 [0.3276]	−0.0454 [−0.2523, 0.1615]	✓
ETTm2	RNN	CDSC	0.5793 [0.1475]	0.5674 [0.1504]	+0.0119 [−0.1094, 0.1332]	✗
ETTm2	RNN	SVCCA	0.5793 [0.1475]	0 [0]	+0.5793 [0.4943, 0.6642]	✗
ETTm2	BiRNN	Patience	0.5811 [0.1548]	0.5600 [0.1770]	+0.0211 [−0.1143, 0.1565]	✗
ETTm2	BiRNN	Slope	0.5811 [0.1548]	0.6541 [0.3434]	−0.0730 [−0.2899, 0.1439]	✓
ETTm2	BiRNN	CDSC	0.5811 [0.1548]	0.5742 [0.1568]	+0.0069 [−0.1200, 0.1338]	✗
ETTm2	BiRNN	SVCCA	0.5811 [0.1548]	0 [0]	+0.5811 [0.4920, 0.6703]	✗
ETTm2	Transformer	Patience	0.4689 [0.0882]	0.2933 [0.4144]	+0.1756 [−0.0683, 0.4196]	✗
ETTm2	Transformer	Slope	0.4689 [0.0882]	0.5357 [0.1184]	−0.0668 [−0.1518, 0.0182]	✓
ETTm2	Transformer	CDSC	0.4689 [0.0882]	0.4281 [0.0805]	+0.0409 [−0.0279, 0.1096]	✗
ETTm2	Transformer	SVCCA	0.4689 [0.0882]	0 [0]	+0.4689 [0.4181, 0.5197]	✗
AirPassengers	RNN	Patience	0 [0]	0 [0]	0 [0, 0]	≈
AirPassengers	RNN	Slope	0 [0]	0 [0]	0 [0, 0]	≈
AirPassengers	RNN	CDSC	0 [0]	58,891.3260 [36,358.8600]	−58,891.3260 [−79,828.0833, −37,954.5687]	✓
AirPassengers	RNN	SVCCA	0 [0]	97,363.3320 [330.6500]	−97,363.3320 [−97,553.7323, −97,172.9317]	✓
AirPassengers	BiRNN	Patience	0 [0]	0 [0]	0 [0, 0]	≈
AirPassengers	BiRNN	Slope	0 [0]	0 [0]	0 [0, 0]	≈
AirPassengers	BiRNN	CDSC	0 [0]	992.3630 [4252.1340]	−992.3630 [−3440.8968, 1456.1708]	✓
AirPassengers	BiRNN	SVCCA	0 [0]	1.539 × 10⁵ [219.3690]	−1.539 × 10⁵ [−1.540 × 10⁵, −1.538 × 10⁵]	✓
AirPassengers	Transformer	Patience	8817.4850 [3716.4060]	6811.7220 [4607.0510]	+2005.7630 [−1402.7118, 5414.2378]	✗
AirPassengers	Transformer	Slope	8817.4850 [3716.4060]	437.5180 [1315.3390]	+8379.9670 [6109.8425, 10,650.0915]	✗
AirPassengers	Transformer	CDSC	8817.4850 [3716.4060]	2036.1630 [2037.0710]	+6781.3220 [4340.8803, 9221.7637]	✗
AirPassengers	Transformer	SVCCA	8817.4850 [3716.4060]	7074.6250 [4660.1190]	+1742.8600 [−1689.4530, 5175.1730]	✗
Lorenz	RNN	Patience	0.0190 [0.0437]	0.0088 [0.0129]	+0.0102 [−0.0160, 0.0365]	✗
Lorenz	RNN	Slope	0.0190 [0.0437]	0.0041 [0.0036]	+0.0149 [−0.0103, 0.0402]	✗
Lorenz	RNN	CDSC	0.0190 [0.0437]	0.0131 [0.0379]	+0.0059 [−0.0274, 0.0392]	✗
Lorenz	RNN	SVCCA	0.0190 [0.0437]	0.0086 [0.0209]	+0.0105 [−0.0174, 0.0384]	✗
Lorenz	BiRNN	Patience	0.0074 [0.0068]	0.0068 [0.0043]	+6.590 × 10⁻⁴ [−0.0040, 0.0053]	✗
Lorenz	BiRNN	Slope	0.0074 [0.0068]	0.0048 [0.0079]	+0.0027 [−0.0033, 0.0087]	✗
Lorenz	BiRNN	CDSC	0.0074 [0.0068]	0.0046 [0.0152]	+0.0029 [−0.0067, 0.0124]	✗
Lorenz	BiRNN	SVCCA	0.0074 [0.0068]	0.0023 [0.0020]	+0.0051 [0.0011, 0.0092]	✗
Lorenz	Transformer	Patience	0.0335 [0.0544]	0.0224 [0.0206]	+0.0111 [−0.0224, 0.0446]	✗
Lorenz	Transformer	Slope	0.0335 [0.0544]	0.0036 [0.0062]	+0.0300 [−0.0015, 0.0614]	✗
Lorenz	Transformer	CDSC	0.0335 [0.0544]	0.0164 [0.0326]	+0.0171 [−0.0194, 0.0536]	✗
Lorenz	Transformer	SVCCA	0.0335 [0.0544]	0.0153 [0.0241]	+0.0182 [−0.0160, 0.0525]	✗
BTC15m	RNN	Patience	1.309 × 10⁻⁶ [6.122 × 10⁻⁷]	6.883 × 10⁻⁷ [9.384 × 10⁻⁷]	+6.207 × 10⁻⁷ [−2.449 × 10⁻⁸, 1.266 × 10⁻⁶]	✗
BTC15m	RNN	Slope	1.309 × 10⁻⁶ [6.122 × 10⁻⁷]	1.126 × 10⁻⁶ [7.583 × 10⁻⁷]	+1.830 × 10⁻⁷ [−3.782 × 10⁻⁷, 7.442 × 10⁻⁷]	✗
BTC15m	RNN	CDSC	1.309 × 10⁻⁶ [6.122 × 10⁻⁷]	1.319 × 10⁻⁶ [7.794 × 10⁻⁷]	−1.000 × 10⁻⁸ [−5.807 × 10⁻⁷, 5.607 × 10⁻⁷]	✓
BTC15m	RNN	SVCCA	1.309 × 10⁻⁶ [6.122 × 10⁻⁷]	9.804 × 10⁻⁸ [1.032 × 10⁻⁷]	+1.211 × 10⁻⁶ [8.535 × 10⁻⁷, 1.568 × 10⁻⁶]	✗
BTC15m	BiRNN	Patience	9.950 × 10⁻⁷ [9.029 × 10⁻⁷]	6.209 × 10⁻⁷ [9.143 × 10⁻⁷]	+3.741 × 10⁻⁷ [−3.658 × 10⁻⁷, 1.114 × 10⁻⁶]	✗
BTC15m	BiRNN	Slope	9.950 × 10⁻⁷ [9.029 × 10⁻⁷]	1.840 × 10⁻⁶ [1.800 × 10⁻⁶]	−8.450 × 10⁻⁷ [−2.005 × 10⁻⁶, 3.146 × 10⁻⁷]	✓
BTC15m	BiRNN	CDSC	9.950 × 10⁻⁷ [9.029 × 10⁻⁷]	1.364 × 10⁻⁶ [1.197 × 10⁻⁶]	−3.690 × 10⁻⁷ [−1.232 × 10⁻⁶, 4.944 × 10⁻⁷]	✓
BTC15m	BiRNN	SVCCA	9.950 × 10⁻⁷ [9.029 × 10⁻⁷]	3.207 × 10⁻⁷ [5.915 × 10⁻⁷]	+6.743 × 10⁻⁷ [5.274 × 10⁻⁸, 1.296 × 10⁻⁶]	✗
BTC15m	Transformer	Patience	6.645 × 10⁻⁶ [9.574 × 10⁻⁶]	4.485 × 10⁻⁶ [7.079 × 10⁻⁶]	+2.160 × 10⁻⁶ [−4.696 × 10⁻⁶, 9.016 × 10⁻⁶]	✗
BTC15m	Transformer	Slope	6.645 × 10⁻⁶ [9.574 × 10⁻⁶]	1.139 × 10⁻⁵ [1.774 × 10⁻⁵]	−4.745 × 10⁻⁶ [−1.635 × 10⁻⁵, 6.863 × 10⁻⁶]	✓
BTC15m	Transformer	CDSC	6.645 × 10⁻⁶ [9.574 × 10⁻⁶]	1.139 × 10⁻⁵ [1.845 × 10⁻⁵]	−4.745 × 10⁻⁶ [−1.671 × 10⁻⁵, 7.224 × 10⁻⁶]	✓
BTC15m	Transformer	SVCCA	6.645 × 10⁻⁶ [9.574 × 10⁻⁶]	7.171 × 10⁻⁶ [3.941 × 10⁻⁶]	−5.260 × 10⁻⁷ [−6.488 × 10⁻⁶, 5.436 × 10⁻⁶]	✓
EEG	RNN	Patience	0.0116 [0.0150]	0.0122 [0.0180]	−5.960 × 10⁻⁴ [−0.0141, 0.0129]	✓
EEG	RNN	Slope	0.0116 [0.0150]	0.0115 [0.0127]	+1.430 × 10⁻⁴ [−0.0112, 0.0115]	✗
EEG	RNN	CDSC	0.0116 [0.0150]	0.0077 [0.0131]	+0.0039 [−0.0076, 0.0154]	✗
EEG	RNN	SVCCA	0.0116 [0.0150]	0.0422 [0.0312]	−0.0306 [−0.0506, −0.0107]	✓
EEG	BiRNN	Patience	0.0140 [0.0256]	0.0145 [0.0416]	−4.550 × 10⁻⁴ [−0.0286, 0.0277]	✓
EEG	BiRNN	Slope	0.0140 [0.0256]	0.0132 [0.0191]	+8.510 × 10⁻⁴ [−0.0175, 0.0192]	✗
EEG	BiRNN	CDSC	0.0140 [0.0256]	0.0127 [0.0114]	+0.0014 [−0.0148, 0.0175]	✗
EEG	BiRNN	SVCCA	0.0140 [0.0256]	0.1173 [0.0589]	−0.1033 [−0.1403, −0.0663]	✓
EEG	Transformer	Patience	0.0216 [0.0196]	0.0162 [0.0275]	+0.0053 [−0.0141, 0.0248]	✗
EEG	Transformer	Slope	0.0216 [0.0196]	0.0273 [0.0247]	−0.0057 [−0.0238, 0.0125]	✓
EEG	Transformer	CDSC	0.0216 [0.0196]	0.0210 [0.0167]	+6.060 × 10⁻⁴ [−0.0142, 0.0154]	✗
EEG	Transformer	SVCCA	0.0216 [0.0196]	0.0206 [0.0175]	+9.930 × 10⁻⁴ [−0.0141, 0.0161]	✗

Table A2. Per-cell comparison on e_stop (stopping epoch, smaller is better). “Earlier?” marks the direction of the median difference.

Dataset	Model	Baseline	SES e_stop, med [IQR]	Baseline e_stop, med [IQR]	Δmed (SES−Base) [95% CI]	Earlier?
ETTh1	RNN	Patience	7.5 [1.0]	23.5 [9.0]	−16.0 [−21.2, −10.8]	✓
ETTh1	RNN	Slope	7.5 [1.0]	5.0 [0]	+2.5 [1.9, 3.1]	✗
ETTh1	RNN	CDSC	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTh1	RNN	SVCCA	7.5 [1.0]	100.0 [0]	−92.5 [−93.1, −91.9]	✓
ETTh1	BiRNN	Patience	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTh1	BiRNN	Slope	7.5 [1.0]	5.0 [0]	+2.5 [1.9, 3.1]	✗
ETTh1	BiRNN	CDSC	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTh1	BiRNN	SVCCA	7.5 [1.0]	100.0 [0]	−92.5 [−93.1, −91.9]	✓
ETTh1	Transformer	Patience	7.0 [0]	19.0 [23.0]	−12.0 [−25.2, 1.2]	✓
ETTh1	Transformer	Slope	7.0 [0]	5.0 [1.0]	+2.0 [1.4, 2.6]	✗
ETTh1	Transformer	CDSC	7.0 [0]	10.0 [0]	−3.0 [−3.0, −3.0]	✓
ETTh1	Transformer	SVCCA	7.0 [0]	100.0 [0]	−93.0 [−93.0, −93.0]	✓
ETTh2	RNN	Patience	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTh2	RNN	Slope	7.5 [1.0]	5.0 [0]	+2.5 [1.9, 3.1]	✗
ETTh2	RNN	CDSC	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTh2	RNN	SVCCA	7.5 [1.0]	100.0 [0]	−92.5 [−93.1, −91.9]	✓
ETTh2	BiRNN	Patience	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTh2	BiRNN	Slope	7.5 [1.0]	5.0 [0]	+2.5 [1.9, 3.1]	✗
ETTh2	BiRNN	CDSC	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTh2	BiRNN	SVCCA	7.5 [1.0]	100.0 [0]	−92.5 [−93.1, −91.9]	✓
ETTh2	Transformer	Patience	7.0 [0]	14.5 [26.5]	−7.5 [−22.8, 7.8]	✓
ETTh2	Transformer	Slope	7.0 [0]	5.0 [1.0]	+2.0 [1.4, 2.6]	✗
ETTh2	Transformer	CDSC	7.0 [0]	10.0 [0]	−3.0 [−3.0, −3.0]	✓
ETTh2	Transformer	SVCCA	7.0 [0]	100.0 [0]	−93.0 [−93.0, −93.0]	✓
ETTm1	RNN	Patience	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTm1	RNN	Slope	7.5 [1.0]	5.0 [0]	+2.5 [1.9, 3.1]	✗
ETTm1	RNN	CDSC	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTm1	RNN	SVCCA	7.5 [1.0]	100.0 [0]	−92.5 [−93.1, −91.9]	✓
ETTm1	BiRNN	Patience	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTm1	BiRNN	Slope	7.5 [1.0]	5.0 [0]	+2.5 [1.9, 3.1]	✗
ETTm1	BiRNN	CDSC	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTm1	BiRNN	SVCCA	7.5 [1.0]	100.0 [0]	−92.5 [−93.1, −91.9]	✓
ETTm1	Transformer	Patience	7.0 [0]	14.5 [26.5]	−7.5 [−22.8, 7.8]	✓
ETTm1	Transformer	Slope	7.0 0]	5.0 [1.0]	+2.0 [1.4, 2.6]	✗
ETTm1	Transformer	CDSC	7.0 [0]	10.0 [0]	−3.0 [−3.0, −3.0]	✓
ETTm1	Transformer	SVCCA	7.0 [0]	100.0 [0]	−93.0 [−93.0, −93.0]	✓
ETTm2	RNN	Patience	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTm2	RNN	Slope	7.5 [1.0]	5.0 [0]	+2.5 [1.9, 3.1]	✗
ETTm2	RNN	CDSC	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTm2	RNN	SVCCA	7.5 [1.0]	100.0 [0]	−92.5 [−93.1, −91.9]	✓
ETTm2	BiRNN	Patience	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTm2	BiRNN	Slope	7.5 [1.0]	5.0 [0]	+2.5 [1.9, 3.1]	✗
ETTm2	BiRNN	CDSC	7.5 [1.0]	10.0 [0]	−2.5 [−3.1, −1.9]	✓
ETTm2	BiRNN	SVCCA	7.5 [1.0]	100.0 [0]	−92.5 [−93.1, −91.9]	✓
ETTm2	Transformer	Patience	7.0 [0]	14.5 [26.5]	−7.5 [−22.8, 7.8]	✓
ETTm2	Transformer	Slope	7.0 [0]	5.0 [1.0]	+2.0 [1.4, 2.6]	✗
ETTm2	Transformer	CDSC	7.0 [0]	10.0 [0]	−3.0 [−3.0, −3.0]	✓
ETTm2	Transformer	SVCCA	7.0 [0]	100.0 [0]	−93.0 [−93.0, −93.0]	✓
AirPassengers	RNN	Patience	100.0 [0]	100.0 [0]	0 [0, 0]	≈
AirPassengers	RNN	Slope	100.0 [0]	100.0 [0]	0 [0, 0]	≈
AirPassengers	RNN	CDSC	100.0 [0]	38.0 [32.2]	+62.0 [43.5, 80.5]	✗
AirPassengers	RNN	SVCCA	100.0 [0]	6.0 [0]	+94.0 [94.0, 94.0]	✗
AirPassengers	BiRNN	Patience	100.0 [0]	100.0 [0]	0 [0, 0]	≈
AirPassengers	BiRNN	Slope	100.0 [0]	100.0 [0]	0 [0, 0]	≈
AirPassengers	BiRNN	CDSC	100.0 [0]	97.0 [10.2]	+3.0 [−2.9, 8.9]	✗
AirPassengers	BiRNN	SVCCA	100.0 [0]	6.0 [0]	+94.0 [94.0, 94.0]	✗
AirPassengers	Transformer	Patience	7.0 [0]	8.0 [15.0]	−1.0 [−9.6, 7.6]	✓
AirPassengers	Transformer	Slope	7.0 [0]	100.0 [0]	−93.0 [−93.0, −93.0]	✓
AirPassengers	Transformer	CDSC	7.0 [0]	20.5 [4.8]	−13.5 [−16.3, −10.7]	✓
AirPassengers	Transformer	SVCCA	7.0 [0]	6.0 [0]	+1.0 [1.0, 1.0]	✗
Lorenz	RNN	Patience	7.0 [0]	11.5 [2.8]	−4.5 [−6.1, −2.9]	✓
Lorenz	RNN	Slope	7.0 [0]	39.5 [22.2]	−32.5 [−45.3, −19.7]	✓
Lorenz	RNN	CDSC	7.0 [0]	11.5 [4.8]	−4.5 [−7.3, −1.7]	✓
Lorenz	RNN	SVCCA	7.0 [0]	6.0 [0]	+1.0 [1.0, 1.0]	✗
Lorenz	BiRNN	Patience	7.0 [0]	11.5 [4.0]	−4.5 [−6.8, −2.2]	✓
Lorenz	BiRNN	Slope	7.0 [0]	20.5 [8.5]	−13.5 [−18.4, −8.6]	✓
Lorenz	BiRNN	CDSC	7.0 [0]	13.0 [2.8]	−6.0 [−7.6, −4.4]	✓
Lorenz	BiRNN	SVCCA	7.0 [0]	6.0 [0]	+1.0 [1.0, 1.0]	✗
Lorenz	Transformer	Patience	7.0 [0]	11.5 [4.0]	−4.5 [−6.8, −2.2]	✓
Lorenz	Transformer	Slope	7.0 [0]	31.0 [14.5]	−24.0 [−32.3, −15.7]	✓
Lorenz	Transformer	CDSC	7.0 [0]	12.0 [2.0]	−5.0 [−6.2, −3.8]	✓
Lorenz	Transformer	SVCCA	7.0 [0]	6.0 [0]	+1.0 [1.0, 1.0]	✗
BTC15m	RNN	Patience	6.0 [1.0]	6.0 [0]	0 [−0.6, 0.6]	≈
BTC15m	RNN	Slope	6.0 [1.0]	5.0 [0]	+1.0 [0.4, 1.6]	✗
BTC15m	RNN	CDSC	6.0 [1.0]	5.0 [1.0]	+1.0 [0.2, 1.8]	✗
BTC15m	RNN	SVCCA	6.0 [1.0]	100.0 [0]	−94.0 [−94.6, −93.4]	✓
BTC15m	BiRNN	Patience	7.0 [1.0]	6.0 [0]	+1.0 [0.4, 1.6]	✗
BTC15m	BiRNN	Slope	7.0 [1.0]	5.0 [0]	+2.0 [1.4, 2.6]	✗
BTC15m	BiRNN	CDSC	7.0 [1.0]	5.0 [0]	+2.0 [1.4, 2.6]	✗
BTC15m	BiRNN	SVCCA	7.0 [1.0]	100.0 [0]	−93.0 [−93.6, −92.4]	✓
BTC15m	Transformer	Patience	6.0 [1.0]	6.0 [0]	0 [−0.6, 0.6]	≈
BTC15m	Transformer	Slope	6.0 [1.0]	5.0 [0]	+1.0 [0.4, 1.6]	✗
BTC15m	Transformer	CDSC	6.0 [1.0]	5.0 [0.8]	+1.0 [0.3, 1.7]	✗
BTC15m	Transformer	SVCCA	6.0 [1.0]	100.0 [0]	−94.0 [−94.6, −93.4]	✓
EEG	RNN	Patience	7.5 [2.8]	7.5 [4.8]	0 [−3.2, 3.2]	≈
EEG	RNN	Slope	7.5 [2.8]	5.0 [1.8]	+2.5 [0.6, 4.4]	✗
EEG	RNN	CDSC	7.5 [2.8]	10.0 [0]	−2.5 [−4.1, −0.9]	✓
EEG	RNN	SVCCA	7.5 [2.8]	100.0 [0]	−92.5 [−94.1, −90.9]	✓
EEG	BiRNN	Patience	6.5 [2.5]	7.5 [4.0]	−1.0 [−3.7, 1.7]	✓
EEG	BiRNN	Slope	6.5 [2.5]	5.0 [0.8]	+1.5 [−0.0, 3.0]	✗
EEG	BiRNN	CDSC	6.5 [2.5]	10.0 [0]	−3.5 [−4.9, −2.1]	✓
EEG	BiRNN	SVCCA	6.5 [2.5]	100.0 [0]	−93.5 [−94.9, −92.1]	✓
EEG	Transformer	Patience	7.5 [4.8]	7.5 [6.8]	0 [−4.8, 4.8]	≈
EEG	Transformer	Slope	7.5 [4.8]	5.5 [1.0]	+2.0 [−0.8, 4.8]	✗
EEG	Transformer	CDSC	7.5 [4.8]	10.0 [0.8]	−2.5 [−5.3, 0.3]	✓
EEG	Transformer	SVCCA	7.5 [4.8]	5.0 [71.2]	+2.5 [−38.6, 43.6]	✗

Table A3. Aggregate sign-test summary across 24 (model × dataset) cells. H₀: P(SES wins) = 0.5; one-sided binomial test; ties excluded from effective n.

Comparison	Metric	SES wins	Ties	Losses	Total	p-Value (One-Sided Sign Test)
SES vs. Patience	ΔBest (↓)	2	2	20	24	1.0000
SES vs. Patience	e_stop (↓)	17	6	1	24	<0.0001
SES vs. Patience	epochs_saved (↑)	17	6	1	24	<0.0001
SES vs. Slope	ΔBest (↓)	15	2	7	24	0.0669
SES vs. Slope	e_stop (↓)	4	2	18	24	0.9996
SES vs. Slope	epochs_saved (↑)	4	2	18	24	0.9996
SES vs. CDSC	ΔBest (↓)	6	0	18	24	0.9967
SES vs. CDSC	e_stop (↓)	19	0	5	24	0.0033
SES vs. CDSC	epochs_saved (↑)	19	0	5	24	0.0033
SES vs. SVCCA	ΔBest (↓)	5	0	19	24	0.9992
SES vs. SVCCA	e_stop (↓)	17	0	7	24	0.0320
SES vs. SVCCA	epochs_saved (↑)	17	0	7	24	0.0320

Reading of Table A1, Table A2 and Table A3. SES stops significantly earlier than Patience, CDSC, and SVCCA (p = 0.0001, 0.0033, and 0.0320, respectively, on e_stop) but does not match Patience or CDSC on regret, confirming the quality–efficiency trade-off framing of Section 4.2. Against Slope, SES is later on e_stop but achieves a marginally smaller median ΔBest in 15 of 24 cells (p = 0.067). The apparent SVCCA advantage on ΔBest reflects SVCCA collapsing to the full training horizon in the majority of cells.

Appendix G. Aggregation-Rule Ablation on the Development Benchmark

Table A4 and Table A5 report an offline recomputation of the SES stopping decision on the ETTh1 development benchmark (σ = 0, 100 epochs, 10 seeds) using five aggregation rules applied to the same stored per-epoch symbolic descriptors. No neural model was retrained. Table A4 reports aggregate results (n = 30, across all three architectures). Table A5 reports the per-model breakdown (n = 10 each).

Table A4. Aggregation-rule ablation: aggregate over all (model, seed) cells (n = 30 per row). failure_rate = fraction of runs with ΔBest > 0.010 (val MSE oracle threshold).

Aggregator	med e_stop	IQR e_stop	med ΔBest	IQR ΔBest	med eps_saved	fail_rate
mean-rank	28.0	13.8	0.0024	0.1441	72.0	0.33
median-rank	26.5	14.0	0.0033	0.1821	73.5	0.33
top-50%	24.5	12.5	0.0025	0.3436	75.5	0.33
top-30%	24.5	12.5	0.0025	0.3436	75.5	0.33
weighted (1/var)	22.5	17.2	0.0022	0.2905	77.5	0.33

Table A5. Aggregation-rule ablation: per-model breakdown (n = 10 per row).

Aggregator	med e_stop	IQR e_stop	med ΔBest	IQR ΔBest	med eps_saved	fail_rate	Model
mean-rank	24.5	8.5	0.0018	0.0006	75.5	0.00	RNN
median-rank	20.5	7.8	0.0026	0.0018	79.5	0.00	RNN
top-50% ‡	22.5	11.0	0.0020	0.0006	77.5	0.00	RNN
top-30% ‡	22.5	11.0	0.0020	0.0006	77.5	0.00	RNN
weighted (1/var)	22.5	19.2	0.0020	0.0014	77.5	0.00	RNN
mean-rank	30.0	17.0	0.0013	0.0020	70.0	0.00	BiRNN
median-rank	31.5	13.8	0.0017	0.0011	68.5	0.00	BiRNN
top-50%	30.0	18.8	0.0014	0.0017	70.0	0.00	BiRNN
top-30%	30.0	18.8	0.0014	0.0017	70.0	0.00	BiRNN
weighted (1/var)	20.0	14.5	0.0015	0.0010	80.0	0.00	BiRNN
mean-rank	31.0	38.0	0.2860	0.2249	69.0	1.00	Transformer
median-rank	30.5	17.2	0.3275	0.2259	69.5	1.00	Transformer
top-50%	21.0	11.5	0.3735	0.1233	79.0	1.00	Transformer
top-30%	21.0	11.5	0.3735	0.1233	79.0	1.00	Transformer
weighted (1/var)	25.5	17.5	0.3086	0.0996	74.5	1.00	Transformer

‡ Note on top-q with K = 4 active descriptors. With K = 4 active descriptors (LZ, h^M, PermEn, D₂; D^f excluded by default), the top-q rule with q ∈ {0.3, 0.5} selects ⏈qK⏉ = 2 smallest ranks in both cases and therefore reduces to the same statistic (the mean of the two smallest ranks). With a larger active panel (e.g., K = 5 when D^f is included), the two variants become distinct.

Conclusion. The ablation supports the use of median-rank as the SES default: it provides a stable compromise between earliness and regret without introducing data-driven weights. Mean-rank achieves marginally lower ΔBest on recurrent architectures but at the cost of additional sensitivity to volatile descriptors. The top-q variants reduce to the mean of the two smallest ranks on this descriptor panel and therefore do not provide a distinct operating point. The inverse-variance weighted variant produces the most aggressive stopping, but with the largest seed-to-seed dispersion and without a corresponding regret advantage. Failure rate = 1.00 for the Transformer reflects a property of ETTh1 at this architecture scale rather than a deficiency of any particular aggregator.

References

Caruana, R.; Lawrence, S.; Giles, C. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in Neural Information Processing Systems 13; MIT Press: Cambridge, MA, USA, 2001; pp. 402–408. [Google Scholar]
Prechelt, L. Early stopping—But when? In Neural Networks: Tricks of the Trade, 2nd ed.; Montavon, G., Orr, G.B., Müller, K.-R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 53–67. [Google Scholar]
Raskutti, G.; Wainwright, M.J.; Yu, B. Early stopping and non-parametric regression: An optimal data-dependent stopping rule. J. Mach. Learn. Res. 2014, 15, 335–366. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), Atlanta, GA, USA, 16–21 June 2013; Volume 28, pp. 1310–1318. [Google Scholar]
Kajitsuka, T.; Sato, I. On the optimal memorization capacity of transformers. In Proceedings of the 13th International Conference on Learning Representations (ICLR 2025), Singapore, 24–28 April 2025. [Google Scholar]
Dana, L.; Pydi, M.S.; Chevaleyre, Y. Memorization in attention-only transformers. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS 2025), Phuket, Thailand, 3–5 May 2025; Volume 258, pp. 3133–3141. [Google Scholar]
Sussillo, D.; Barak, O. Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neural networks. Neural Comput. 2013, 25, 626–649. [Google Scholar] [CrossRef] [PubMed]
Maheswaranathan, N.; Williams, A.; Golub, M.; Ganguli, S.; Sussillo, D. Universality and individuality in neural dynamics across large populations of recurrent networks. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Dao, T.; Gu, A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Kim, C.M.; Chow, C.C. Learning recurrent dynamics in spiking networks. eLife 2018, 7, e37124. [Google Scholar] [CrossRef]
Mastrogiuseppe, F.; Carmona, J.; Machens, C.K. Stochastic activity in low-rank recurrent neural networks. PLoS Comput. Biol. 2025, 21, e1013371. [Google Scholar] [CrossRef]
Rieck, B.; Togninalli, M.; Bock, C.; Moor, M.; Horn, M.; Gumbsch, T.; Borgwardt, K. Neural persistence: A complexity measure for deep neural networks using algebraic topology. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Gutiérrez-Fandiño, A.; Pérez-Fernández, D.; Armengol-Estapé, J.; Villegas, M. Persistent homology captures the generalization of neural networks without a validation set. arXiv 2021, arXiv:2106.00012. [Google Scholar] [CrossRef]
Zhang, B.; Lin, H. Functional loops: Monitoring functional organization of deep neural networks using algebraic topology. Neural Netw. 2024, 174, 106239. [Google Scholar] [CrossRef] [PubMed]
Zia, A.; Khamis, A.; Nichols, J.; Hayder, Z.; Rolland, V.; Petersson, L. Topological deep learning: A review of an emerging paradigm. Artif. Intell. Rev. 2024, 57, 77. [Google Scholar] [CrossRef]
Damrich, S.; Berens, P.; Kobak, D. Persistent homology for high-dimensional data based on spectral methods. In Proceedings of the Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Singh, G.; Mémoli, F.; Carlsson, G. Topological methods for the analysis of high dimensional data sets and 3D object recognition. In Eurographics Symposium on Point-Based Graphics; Botsch, M., Pajarola, R., Chen, B., Zwicker, M., Eds.; The Eurographics Association: Geneva, Switzerland, 2007; pp. 91–100. [Google Scholar] [CrossRef]
Madukpe, V.N.; Ugoala, B.C.; Zulkepli, N.F.S. A comprehensive review of the Mapper algorithm, a topological data analysis technique, and its applications across various fields (2007–2025). arXiv 2025, arXiv:2504.09042. [Google Scholar] [CrossRef]
Haşegan, D.; Patel, S.; Sahoo, A.; Saggar, M. Deconstructing the Mapper algorithm to extract richer topological and temporal features from functional neuroimaging data. Netw. Neurosci. 2024, 8, 1355–1382. [Google Scholar] [CrossRef] [PubMed]
Simpson, S.G. Symbolic dynamics: Entropy = dimension = complexity. Theory Comput. Syst. 2015, 56, 527–543. [Google Scholar] [CrossRef]
Hirata, Y.; Amigó, J.M. A review of symbolic dynamics and symbolic reconstruction of dynamical systems. Chaos 2023, 33, 052101. [Google Scholar] [CrossRef]
Ren, Q.; Zhang, J.; Xu, Y.; Wang, Y.; Yu, Y.; Zhang, Q. Towards the dynamics of a DNN learning symbolic interactions. In Proceedings of the Advances in Neural Information Processing Systems 38 (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Das, J.; Bhaumik, B.; De, S.; Mitra, A. Physics-informed neural network with symbolic regression for deriving analytical approximate solutions to nonlinear partial differential equations. Neural Comput. Appl. 2025, 37, 20205–20240. [Google Scholar] [CrossRef]
Raghu, M.; Gilmer, J.; Yosinski, J.; Sohl-Dickstein, J. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Proceedings of the Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ferro, M.V.; Mosquera, Y.D.; Pena, F.J.R.; Bilbao, V.M.D. Early stopping by correlating online indicators in neural networks. Neural Netw. 2023, 159, 109–124. [Google Scholar] [CrossRef]
Nakkiran, P.; Kaplun, G.; Bansal, Y.; Yang, T.; Barak, B.; Sutskever, I. Deep double descent: Where bigger models and more data hurt. arXiv 2019, arXiv:1912.02292. [Google Scholar] [CrossRef]
Xia, X.; Liu, T.; Han, B.; Gong, C.; Wang, N.; Ge, Z.; Chang, Y. Robust early-learning: Hindering the memorization of noisy labels. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, 3–7 May 2021. [Google Scholar]
Hensel, F.; Moor, M.; Rieck, B. A survey of topological machine learning methods. Front. Artif. Intell. 2021, 4, 681108. [Google Scholar] [CrossRef] [PubMed]
Naitzat, G.; Zhitnikov, A.; Lim, L.-H. Topology of deep neural networks. J. Mach. Learn. Res. 2020, 21, 1–40. [Google Scholar]
Mediano, P.A.M.; Rosas, F.E.; Bor, D.; Seth, A.K.; Barrett, A.B. Spectrally and temporally resolved estimation of neural signal diversity. bioRxiv 2023. [Google Scholar] [CrossRef]
Hussein, B.M.; Shareef, M.S. An empirical study on the correlation between early stopping patience and epochs in deep learning. ITM Web Conf. 2024, 64, 01003. [Google Scholar] [CrossRef]
Hu, T.; Lei, Y. Early stopping for iterative regularization with general loss functions. J. Mach. Learn. Res. 2022, 23, 1–36. [Google Scholar]
Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef]
Welch, T.A. A technique for high-performance data compression. Computer 1984, 17, 8–19. [Google Scholar] [CrossRef]
Höhn, C.; Hahn, M.A.; Lendner, J.D.; Hoedlmoser, K. Spectral slope and Lempel–Ziv complexity as robust markers of brain states during sleep and wakefulness. eNeuro 2024, 11. [Google Scholar] [CrossRef]
Dingle, K.; Hamzi, B.; Hutter, M.; Owhadi, H. Retrodicting chaotic systems: An algorithmic information theory approach. arXiv 2025, arXiv:2507.04780. [Google Scholar] [CrossRef]
Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
Bandt, C.; Pompe, B. Permutation entropy: A natural complexity measure for time series. Phys. Rev. Lett. 2002, 88, 174102. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time Series Analysis: Forecasting and Control, 3rd ed.; Prentice Hall: Englewood Cliffs, NJ, USA, 1994. [Google Scholar]
McNally, S.; Roche, J.; Caton, S. Predicting the price of Bitcoin using machine learning. In Proceedings of the 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2018); IEEE: Piscataway, NJ, USA, 2018; pp. 339–343. [Google Scholar] [CrossRef]
Koelstra, S.; Mühl, C.; Soleymani, M.; Lee, J.-S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. DEAP: A database for emotion analysis using physiological signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [Google Scholar] [CrossRef]
Brunton, S.L.; Proctor, J.L.; Kutz, J.N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USA 2016, 113, 3932–3937. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
Miseta, T.; Fodor, A.; Vathy-Fogarassy, Á. Surpassing early stopping: A novel correlation-based stopping criterion for neural networks. Neurocomputing 2024, 567, 127028. [Google Scholar] [CrossRef]

Figure 1. Mapper-style graph over the hidden-state space. Red and green dots show individual projected hidden states; blue circles denote selected exemplar states; black nodes connected by black edges form the Mapper graph built from these exemplars. The purple lines indicate the induced partition of the plane (Voronoi-like regions) around the graph nodes.

Figure 2. Conceptual diagram of the Symbolic Early-Stopping (SES) pipeline.

Figure 3. AirPassengers example fragment and the corresponding Mapper graph of hidden states (2D PCA lens; nodes are clusters of states; edges indicate cluster overlaps across neighboring cover regions).

Figure 4. Dispersion of stopping epochs across N = 10 random seeds for individual symbolic-dynamics descriptors and their combinations on AirPassengers (boxplots show the median and IQR). The SES ensemble yields a lower dispersion than most of the single descriptors.

Figure 5. Evolution of symbolic-dynamics metrics across epochs for AirPassengers (RNN, BiRNN, and Transformer; values are normalized and averaged in the same display regime as used in the protocol).

Figure 6. EEG data. EEG example fragment (temporal evolution of PCA1 across samples) (a) and the corresponding Mapper graph of hidden states (3D PCA lens) (b). Node color encodes the mean continuous EEG label.

Figure 7. Dispersion of stopping epochs across N = 10 random seeds for individual symbolic-dynamics descriptors and their combinations on EEG (boxplots show median and IQR).

Figure 8. Evolution of symbolic-dynamics metrics across epochs for EEG (RNN, BiRNN, and Transformer; same normalization and display convention as in the protocol).

Figure 9. ETT data. ETTh2(OT) example fragment and the corresponding Mapper graph of hidden states (2D PCA lens). Thick green edges indicate dominant transitions along the temporal trajectory.

Figure 10. Dispersion of stopping epochs across N = 10 random seeds for individual symbolic-dynamics descriptors and their combinations on ETTm1 (boxplots show median and IQR).

Figure 11. Evolution of symbolic-dynamics metrics across epochs for ETTm1 (RNN, BiRNN, and Transformer; same normalization and display convention as in the protocol).

Figure 12. Lorenz data. Fragment of Lorenz trajectory example and the corresponding Mapper graph of hidden states (3D PCA lens). The multi-region connectivity of the graph reflects the attractor structure and transitions between latent regimes.

Figure 13. Dispersion of stopping epochs across N = 10 random seeds for individual symbolic-dynamics descriptors and their combinations on the Lorenz trajectory (boxplots show median and IQR).

Figure 14. Evolution of symbolic-dynamics metrics across epochs for Lorenz trajectory (RNN, BiRNN, and Transformer; same normalization and display convention as in the protocol).

Table 1. Runtime profiling on ETTh1 (noise-free, 10 seeds, E_max = 100). Base epoch = training + validation time. Repr epoch = training + validation + representation extraction time. Overhead is the relative increase of the representation-aware epoch cost over the base epoch cost. For each stopping criterion, entries are reported as total wall-clock time in seconds/time saved relative to full training in %/ΔBest. All values are medians across seeds; base, repr, and full columns are reported as median [IQR].

Model	Base epoch, s	Repr epoch, s	Overhead, %	Full Training, s	Patience (s/%/ΔBest)	Slope (s/%/ΔBest)	CDSC (s/%/ΔBest)	SES (s/%/ΔBest)	SVCCA (s/%/ΔBest)
RNN	3.37 [0.18]	3.51 [0.18]	3.95 [0.04]	337.4 [17.6]	126.3/64.0/0.0006	111.9/65.5/0.0011	14.2/95.8/0.0115	116.4/64.9/0.0006	351.4/−4.1/0.0035
BiRNN	5.64 [0.03]	5.87 [0.04]	4.07 [0.02]	563.8 [3.4]	219.5/61.0/0.0013	158.0/72.0/0.0010	23.6/95.8/0.0085	177.9/69.9/0.0007	588.3/−4.4/0.0062
Transformer	33.25 [0.46]	33.57 [0.46]	0.98 [0.01]	3324.7 [45.5]	400.8/88.0/0.3662	400.0/88.0/0.3802	134.5/96.0/0.0682	1195.7/64.5/0.2052	1855.4/44.4/0.2120

Table 2. Mapper hyperparameters for SES on ETTh1 (RNN, σ = 0.0, 10 runs).

cfg_id (short)	bins	overlap	local_k	merge_eps	win_rate ↑	median_rel_regret ↓	median_saved_frac ↑	score ↓
9392dc	8	0.30	10	0.50	0.200	0.467	0.510	0.990
dc1b40	8	0.20	10	0.00	0.133	0.524	0.425	1.084
2210ff	8	0.40	10	1.00	0.100	0.986	0.000	1.297
09256c	6	0.40	5	0.75	0.067	1.016	0.000	1.337

Table 3. Transferability of SES/Mapper across architectures on ETTh1.

Model	N	SES e_stop (Median [IQR])	SES ΔBest (Median [IQR])	SES epochs_saved (Median [IQR])	Best SOTA	SOTA e_stop (Median [IQR])	SOTA ΔBest (Median [IQR])	SOTA epochs_saved (Median [IQR])
RNN	10	18 [1.00]	0.002912 [0.003918]	82 [1.00]	Patience	36 [6.75]	0.000576 [0.000771]	64 [6.75]
BiRNN	10	17 [1.50]	0.002172 [0.001218]	83 [1.50]	Slope	28 [11.25]	0.001017 [0.001065]	72 [11.25]
Transformer	10	15 [0.00]	0.000207 [0.000331]	85 [0.00]	Patience	15 [1.50]	0.000221 [0.000323]	85 [1.50]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tomilov, I.; Zamotaev, R.; Gusarova, N.; Vatian, A. Symbolic Early Stopping in Neural Sequence Models via Mapper-Induced Symbolic Dynamics. Technologies 2026, 14, 339. https://doi.org/10.3390/technologies14060339

AMA Style

Tomilov I, Zamotaev R, Gusarova N, Vatian A. Symbolic Early Stopping in Neural Sequence Models via Mapper-Induced Symbolic Dynamics. Technologies. 2026; 14(6):339. https://doi.org/10.3390/technologies14060339

Chicago/Turabian Style

Tomilov, Ivan, Rodion Zamotaev, Natalia Gusarova, and Aleksandra Vatian. 2026. "Symbolic Early Stopping in Neural Sequence Models via Mapper-Induced Symbolic Dynamics" Technologies 14, no. 6: 339. https://doi.org/10.3390/technologies14060339

APA Style

Tomilov, I., Zamotaev, R., Gusarova, N., & Vatian, A. (2026). Symbolic Early Stopping in Neural Sequence Models via Mapper-Induced Symbolic Dynamics. Technologies, 14(6), 339. https://doi.org/10.3390/technologies14060339

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symbolic Early Stopping in Neural Sequence Models via Mapper-Induced Symbolic Dynamics

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Method Overview and Interpretation

3.2. Method Description

3.3. Metrics of Symbolic Dynamics

3.4. Datasets and Metrics

3.5. Experimental Techniques

4. Results and Discussion

4.1. Variability of Symbolic-Dynamics Indicators (E1 Group of Experiments)

4.2. Comparison with Baseline Early-Stopping Methods (E2 Group of Experiments)

4.3. Robustness Under Additive Gaussian Noise (E3 Group of Experiments)

4.4. Layer-Wise Analysis and Robustness Interpretation (E4 Group of Experiments)

4.5. Mapper Hyperparameter Transfer Across Architectures (E5 Group of Experiments)

4.6. Runtime Profiling and Global Robustness Interpretation (E6 Group of Experiments)

4.7. Reproducibility and Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Algorithmic Specification of SES

Symbolic Early Stopping (SES, Hybrid Rule)

Appendix B. Detailed Description of the Experimental Techniques

Appendix C. Detailed Description of the Results of the E2 Group Experiments

Appendix D. Detailed Description of the Results of the E3 Group Experiments

Appendix Е. Detailed Description of the Results of the E4 Group Experiments

Appendix F. Statistical Comparison of SES with Baseline Stopping Rules

Appendix G. Aggregation-Rule Ablation on the Development Benchmark

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI