1. Introduction
As AI-driven scientific research continues to advance and automated analysis of open healthcare data becomes increasingly prevalent, large-scale physiological sensing networks are emerging as essential components of health monitoring, intelligent systems, and connected digital infrastructures. Wearable sensors, cloud-computing platforms, and distributed processing pipelines now enable the continuous acquisition and real-time analysis of cardiovascular signals across wide populations. As the scale and complexity of these networked systems expand, the need for robust signal-processing methodologies capable of reliably extracting physiological information from heterogeneous and often noisy data streams has become more critical than ever. Heart rate variability (HRV), derived from the beat-to-beat (RR) interval time series, is widely recognized as a key biomarker of autonomic regulation and overall physiological state. Autoregressive (AR) modeling is widely used for spectral estimation in HRV analysis. A critical parameter in AR modeling is the model order (p), which directly determines spectral smoothness and peak structure. Among the various spectral estimation methods, AR modeling is one of the most established approaches for computing the power spectral density (PSD) of HRV. AR models provide high spectral resolution even for short data segments, which is an indispensable advantage for network-based health-monitoring applications where data may be fragmented, intermittently transmitted, or uneven in quality. A fundamental requirement in AR spectral analysis is the appropriate selection of the model order (p), as it directly affects model stability, spectral fidelity, and computational efficiency in large-scale distributed environments. Information criteria such as the Akaike Information Criterion (AIC) are frequently employed for automatic order selection in large-scale or automated HRV pipelines. The AIC is one of the most commonly used methods for AR order selection because it balances model fit against complexity. However, despite its theoretical grounding, the AIC is known to exhibit instability when the candidate search range becomes excessively large, leading to overfitting, inconsistent model selection, and spurious spectral peaks [
1,
2,
3]. This issue is particularly problematic in automated HRV analytics deployed on cloud platforms or embedded devices, where decisions must remain robust despite fluctuations in noise levels, data length, and sampling consistency. Furthermore, AIC-based order selection depends not only on the data but also on the predefined search range of candidate orders p
max). While this dependency is theoretically acknowledged, its practical impact on HRV spectral estimation has not been systematically evaluated using large open datasets. These limitations of the AIC reflect a deeper challenge: the long-standing dilemma surrounding the “curse of dimensionality” and the often-arbitrary nature of dimensionality reduction in data analysis. Conventional approaches typically assume that reducing the number of model variables is inherently desirable, yet such reduction is ultimately imposed for human or computational convenience rather than dictated by the intrinsic structure of the data. Information criteria such as the AIC provide numerical guidance regarding model dimensionality, but their thresholds remain fundamentally heuristic. Furthermore, traditional model selection frameworks consider only the inclusion or exclusion of variables, and do not account for more nuanced possibilities—such as identifying optimal variable weightings or continuous transformations that preserve all dimensions while equalizing information density across axes. Recent perspectives in manifold learning and information-theoretic modeling argue that the curse of dimensionality is not determined solely by the number of dimensions but by the distribution of information density within the data space. High-dimensional datasets may in practice lie on low-dimensional manifolds where local neighborhoods admit approximately orthogonal coordinates, making the effective dimensionality much smaller than the nominal dimensionality. Conversely, reducing dimensionality without respecting the underlying information density can introduce distortion or eliminate variables that exert subtle but meaningful influences on the system. While several advanced techniques—such as independent component analysis (ICA), Infomax-based methods, and curvature-based manifold learning—attempt to address these issues, they bring their own challenges, including sensitivity to non-Gaussianity, instability in hyperparameter tuning, or high computational complexity. Motivated by these considerations, the present study systematically evaluates the robustness of AIC-based AR order selection using the PhysioNet healthy subject RR interval database (N = 1257). Specifically, we investigate how the AIC behaves when the maximum permissible AR order is intentionally set far beyond common recommendations (e.g., p
max = 50 instead of p
max = 20). By examining this extreme setting, our goal is to better understand the interplay between information-theoretic model selection, dimensionality, and overfitting in large-scale HRV processing pipelines operating within the future internet ecosystem. Based on this objective, we hypothesize that:
H1. The selected order increases when pmax is expanded.
H2. The difference between the best and second-best AIC values decreases under large pmax.
H3. Excessively high orders increase spectral peak fragmentation.
This study provides a quantitative robustness analysis of information-criterion-based AR order selection under controlled expansion of the search space, aiming to establish more reliable guidelines for automated physiological signal processing.
2. Materials and Methods
2.1. Data Sources
This study employed RR interval (RRI) time series from two publicly available databases hosted on PhysioNet, a large-scale open-access platform providing physiological signal datasets, software, and related documentation for scientific research.
- (1)
Healthy Subjects RR Interval Database
This database provides normal sinus rhythm ECG recordings obtained from healthy volunteers, with recording durations ranging from 5 min to 24 h. The RR interval time series extracted from these ECGs are widely used as a healthy control cohort in heart rate variability (HRV) research. Because the subjects exhibit normal cardiac rhythms without pathology, the dataset is particularly well suited for methodological evaluation and baseline variability studies.
In the present study, artifact detection and preprocessing of the RR interval (RRI) time series were conducted according to commonly accepted standards in heart rate variability (HRV) analysis and biological signal processing. First, a physiological range filter was applied to exclude biologically implausible RRI values. Intervals shorter than 300 ms or longer than 12,000 ms were removed as outliers. These thresholds correspond approximately to heart rates outside the range of 40–200 beats per minute and are widely used to eliminate obvious detection errors. Second, a percentage-based filter was implemented to identify abrupt beat-to-beat changes. R-R intervals differing by more than 20% from the immediately preceding interval were classified as artifacts. Sudden changes exceeding this threshold are unlikely under normal sinus rhythm conditions and are therefore considered indicative of measurement noise or ectopic events. No additional statistical deviation filtering (e.g., ±3 standard deviations from a moving average) was applied. Furthermore, detected artifacts were not corrected using interpolation methods such as linear or spline interpolation; instead, only the above two standard exclusion criteria were employed to ensure methodological consistency and to avoid introducing smoothing-related bias into the AR spectral analysis. This preprocessing approach aligns with conventional HRV methodological practice and was chosen to maintain transparency and reproducibility in the evaluation of AIC-based AR order selection robustness.
- (2)
Autonomic Aging Database
The Autonomic Aging Database was designed to investigate age-related differences in autonomic nervous system regulation. It contains approximately 120 min of supine resting ECG recordings collected from two groups of healthy individuals:
Young adults (approximately in their 20s)
Older adults (approximately in their 70s)
From these ECG records, RRI time series were derived to enable quantitative HRV analysis. The database has been widely used to characterize age-related reductions in parasympathetic activity and changes in autonomic balance.
Across both datasets, the RRI time series represent the interval between successive R-wave peaks in the ECG and provide the fundamental temporal structure for HRV assessment.
For the healthy subject database, the original data set was approximately 24 h of recording, but in this study, the first 1000 heartbeats were extracted and used under consistent conditions. The Autonomic Aging database uses the entire recording, approximately 15 min in length. In both cases, one file is treated as one subject (record).
- (3)
Stationarity Assessment
When long-term RR interval recordings such as those contained in the PhysioNet Healthy Subjects RR Interval Database and the Autonomic Aging Database are evaluated using formal stationarity tests (the Augmented Dickey–Fuller; ADF test), the outcome strongly depends on the segmentation strategy. If the entire long-duration recording (e.g., 24 h) is analyzed as a single time series, the data are typically classified as nonstationary (i.e., unit root present). This is primarily due to circadian variation: baseline heart rate and variance differ substantially between sleep and wakefulness, violating the assumption of constant mean and variance. In addition, slow trends reflecting large-scale autonomic modulation increase the probability that the ADF test identifies a unit root.
In contrast, when short-term segments (approximately 5 min) are analyzed, a substantially larger proportion of segments are statistically classified as stationary. Over short resting intervals, baseline fluctuations are smaller, and the null hypothesis of a unit root can often be rejected. This is consistent with conventional HRV frequency-domain analysis, where short-term (5-min) stationarity is typically assumed.
Database-specific tendencies are also observed. In the Healthy Subjects, younger individuals often exhibit pronounced respiratory sinus arrhythmia (RSA). These oscillatory dynamics tend to show mean-reverting behavior, which can lead to segments being classified as stationary in short-term analysis, although abrupt state transitions (e.g., arousal) may still produce transient nonstationarity. In the Autonomic Aging Database, age-related reductions in autonomic flexibility result in smaller RRI fluctuations. Statistically, such flatter dynamics are more likely to be classified as stationary by ADF testing, although this reflects reduced physiological complexity rather than enhanced stability.
In the present study, segments were not excluded based on formal stationarity criteria. This decision was made for methodological and conceptual reasons. While strict stationarity screening is traditionally recommended when applying classical frequency-domain techniques (FFT-based LF and HF power estimation), contemporary analytical approaches in nonlinear dynamics and machine learning often intentionally retain nonstationary structure. In physiological signals, nonstationarity itself may encode meaningful biological information, such as stress responses or dynamic autonomic adaptation. Excessively strict exclusion criteria may therefore remove physiologically relevant variability, particularly in publicly available datasets such as those from PhysioNet that reflect natural, dynamic conditions.
For classical spectral estimation, it is well recognized that strong trends can induce spectral leakage, artificially increasing low-frequency power. In such cases, exclusion or explicit detrending procedures may be appropriate. However, because the primary objective of the present study was to evaluate the robustness of AIC-based autoregressive (AR) order selection under realistic conditions—including naturally occurring variability—segments were retained without ADF-based exclusion. To mitigate potential spectral distortion in analyses involving Fourier-based methods, window functions (Hanning window) were applied where appropriate to reduce edge effects and minimize leakage due to mild nonstationarity within segments. This approach reflects a balanced methodological stance: acknowledging classical stationarity assumptions in frequency-domain HRV analysis while recognizing that modern nonlinear and data-driven frameworks increasingly treat nonstationarity not as a nuisance to be eliminated, but as a potentially informative property of biological signals.
In Healthy Subjects RR Interval Database, the first 1000 consecutive heartbeats were extracted and used for analysis. This segment corresponds approximately to a 5-min resting period under stable conditions and is consistent with standard short-term HRV analysis protocols. In Dataset B Autonomic Aging Database, the approximately 15-min continuous recordings were divided into consecutive, non-overlapping 5-min segments to ensure comparability with conventional short-term HRV methodology.
In addition to formal stationarity considerations, we also examined results from nonlinear analysis using Detrended Fluctuation Analysis (DFA) as an alternative perspective on dynamic structure. DFA quantifies the scaling exponent (α), which characterizes long-range correlation properties in RR interval time series. When α approaches 1.5, the signal exhibits Brownian noise–like behavior, indicating strong nonstationarity and random-walk characteristics. In contrast, values between approximately 0.5 and 1.0 are generally interpreted as reflecting relatively stable scaling behavior, with α ≈ 1.0 corresponding to 1/f dynamics and preserved physiological complexity.
The interpretation of the scaling exponent is particularly suitable for evaluating the complexity of biological systems:
* α ≈ 1.0: 1/f fluctuation, typically observed in healthy physiological systems, reflecting balanced correlations and adaptive complexity.
* α ≈ 0.5: White noise-like behavior, indicating loss of correlation and disorganized dynamics, sometimes observed in pathological conditions such as heart failure.
* α ≈ 1.5: Brown noise-like behavior, reflecting strong nonstationarity and a random-walk-like structure.
Within datasets available through PhysioNet, including heart failure (CHF), atrial fibrillation (AF), and aging-related databases, DFA has become an established and widely accepted tool for distinguishing healthy from pathological cardiac dynamics. Since its introduction in the 1990s, DFA-based scaling analysis has been regarded as a gold-standard nonlinear approach in HRV research.
Representative studies include the following: First, Peng C-K et al. (1995) [
4] analyzed normal sinus rhythm (NSR) and congestive heart failure (CHF) datasets that later formed the foundation of PhysioNet databases. They demonstrated that healthy hearts exhibit scaling exponents close to α ≈ 1.0, whereas CHF patients show significant deviation from this value, reflecting a breakdown of fractal organization and loss of physiological complexity. Second, Ho K.K. et al. (1997) [
5] showed that a reduction in the short-term scaling exponent (α
1) is a stronger predictor of mortality risk in heart failure patients than conventional time-domain indices such as SDNN. Lower α
1 values were associated with impaired autonomic regulation and increased risk of sudden cardiac death. Third, Iyengar N. et al. (1996) [
6] examined age-related alterations in fractal scaling properties. Comparing younger and elderly groups, they reported a decline in α
1 with aging, indicating reduced fractal complexity of heartbeat dynamics. This finding suggests that aging, similar to pathological conditions, is associated with loss of physiological complexity. Because the datasets used in the present study include healthy and older individuals, we referred particularly to findings from the NSR–CHF comparison and aging-related investigations.
2.2. Data Processing Environment
All analyses were performed using Python 3.12. The following scientific libraries were used:
wfdb for PhysioNet data access;
NumPy for numerical computation;
pandas for data handling;
statsmodels for autoregressive (AR) model estimation;
matplotlib for visualization;
tqdm for progress monitoring.
The PhysioNet datasets were obtained using automated download scripts executed via wget. RRI signals were read from the downloaded CSV files, and missing values were removed prior to analysis.
Bootstrap Stability Procedure: To evaluate the robustness of the analytical results against small fluctuations in the data, a bootstrap stability procedure was implemented. The bootstrap stability procedure refers to a resampling-based framework used to assess how sensitive model outcomes and derived features are to minor variations in the dataset. In statistical modeling and machine learning, this process is widely regarded as an essential step for ensuring reliability and reproducibility.
In the present study, bootstrap resampling was performed at the level of the RR interval (RRI) sequence. For each eligible 5-min segment, bootstrap samples were generated using sampling with replacement (resampling strategy: nonparametric bootstrap). Each resampled segment had the same length as the original 5-min segment to preserve comparability of spectral and autoregressive (AR) modeling conditions.
A total of N = 1000 bootstrap iterations were conducted for each segment. For every bootstrap sample, the complete analytical pipeline was repeated, including AR model estimation and model order selection based on the Akaike Information Criterion (AIC).
The stability assessment consisted of the following steps:
Sampling: From the original RRI segment, 1000 bootstrap samples were generated by random sampling with replacement.
Model Construction: For each bootstrap sample, AR model estimation and order selection were performed under identical settings.
Aggregation of Metrics: Model order selection frequency: The proportion of bootstrap trials in which a specific AR order was selected by the AIC.
Parameter distribution: The empirical distribution of estimated AR coefficients and related model parameters (details regarding root location relative to the unit circle are described in
Section 2.6.).
Quantitative Definition of Robustness: Robustness was operationally defined using the variance of the selected AR order across bootstrap samples. A smaller variance indicates higher stability of model order determination. Additionally, a high selection frequency for a particular order suggests structural robustness of the underlying dynamics. This procedure is important in physiological signal analysis. Biological signals obtained from publicly available datasets such as those hosted on PhysioNet are inherently influenced by inter-individual variability, measurement noise, and subtle state transitions. Bootstrap evaluation helps prevent overfitting by distinguishing genuinely stable modeling outcomes from results that may occur by chance in a single dataset realization.
2.3. Autoregressive Modeling and AIC-Based Order Selection
To evaluate the robustness of the Akaike Information Criterion (AIC) for AR model order selection, we estimated the optimal AR order p for each subject under two different constraints on the maximum allowable lag:
Condition 1: maximum lag = 20;
Condition 2: maximum lag = 50.
For a given RRI time series, the AR model of order p was fitted using the AutoReg implementation from statsmodels. For each candidate order p = 1,2,…,max_lag, the corresponding AIC value was computed, and the order yielding the minimum AIC was selected as optimal.
A custom Python function (find_best_ar_order) automated this procedure by iteratively fitting AR models across candidate orders and tracking the minimum AIC. Subjects for which model fitting failed due to numerical instability were skipped.
Model-selection robustness was quantified using three metrics:
- (1)
ΔAIC = AIC2 − AIC1 (difference between best and second-best models);
- (2)
Boundary selection rate (frequency of popt = pmax);
- (3)
Bootstrap stability index (percentage of identical order selection across 100 resamples).
The implementation used, AutoReg from the statsmodels library, uses conditional maximum likelihood (CML), which is mathematically completely equivalent to ordinary least squares (OLS) and coincides with the likelihood-based definition of AIC.
2.4. Comparative Analysis of AIC-Selected Orders
For each subject, AIC-optimal AR orders were independently estimated for max_lag = 20 and max_lag = 50. To compare the distributional behavior of the AIC under these two conditions, histograms were generated:
Histogram 1. distribution of AIC-selected orders with max_lag = 20;
Histogram 2. distribution with max_lag = 50.
This comparison highlights how expanding the candidate order range influences AIC-driven model selection, particularly with respect to overfitting and the tendency to capture high-frequency noise or nonstationary fluctuations.
2.5. AIC-Based Model Selection Procedure
The Akaike Information Criterion (AIC) was used to determine the optimal order p of the autoregressive (AR) models. The AIC is defined as (1):
where:
L: maximum likelihood of the fitted model, representing how well the model explains the observed data.
ln(L): natural logarithm of the maximum likelihood.
k: number of free parameters in the model.
For an AR(p) model,
k typically increases proportionally with p (e.g., k = p + 1 or k = p + 2, depending on whether an intercept and noise variance are included).
The number of parameters (k) aims to minimize the prediction error for unknown data. The downside is that when the number of parameters is large compared to the number of samples, there is a tendency to select an overly complex model (overfitting).
The AIC balances two opposing effects on goodness of fit; increasing the order p improves the fit, increasing L and reducing the term −2ln(L). Model complexity penalty: higher orders introduce more parameters, increasing the penalty term +2k. Thus, model order selection using the AIC involves computing the AIC value for each candidate order (2):
fitting the corresponding AR model, and selecting the order with the minimum AIC (3):
Robustness was operationally defined as (1) ΔAIC = AIC2 − AIC1, (2) boundary selection rate, and (3) bootstrap stability index (percentage of identical order selection across 100 resamples). This procedure operationalizes the trade-off between improved likelihood and increased model complexity, allowing a statistically principled determination of AR model order.
The AR-based power spectral density was computed as:
where σ
2 denotes residual variance and a
k are AR coefficients. The spectrum was evaluated over 0–0.5 Hz using 1024 frequency points.
In this study, the BIC (Bayesian Information Criterion), AICc (corrected AIC), and FPE (final prediction error) are calculated for comparison with the AIC. When applied to the robustness assessment of the AIC, the AIC vs. BIC shows that the AIC captures spectral details, but if the search range (p
max) is wide, it picks up noise and is prone to overfitting (H1, H3). On the other hand, the BIC makes more conservative (lower order) selections and is more robust. The calculation formulas for each are as follows:
n is the sample size. The penalty term k is ln(n), so the more data there is, the more severely the “complex model” is restricted. It is suitable for identifying the “true model” and can be used as a comparison when the AIC order becomes too large.
This method corrects for bias when the sample size n is small. It is appropriate when n is small or when the order p of the AR model is non-negligible relative to n.
σ
2 denotes the residual variance. FPE is one of the first indices proposed for AR models. When n is large, minimizing FPE is theoretically equivalent to minimizing AIC.
2.6. AR Model Stability Verification
After selecting the AR model order using information criteria (AIC, BIC, AICc, FPE), model stability was formally verified using a standard time-series approach based on the roots of the characteristic polynomial.
An AR(p) model is defined as:
where
ϕi are AR coefficients and
εt is white noise.
The corresponding characteristic equation is:
Stability (and stationarity) of the AR model is guaranteed if all roots z of this equation satisfy the condition that their absolute values are greater than 1. Equivalently, when expressed in reciprocal polynomial form, the model is stable if all roots lie strictly inside the unit circle (i.e., root modulus <1 in the alternative representation).
If all roots are located inside the unit circle in the complex plane, the AR process is stable and causal, meaning that the time series converges to bounded values over time. This condition is essential to ensure that the estimated power spectral density (PSD) is mathematically valid and physiologically interpretable.
Implementation Procedure: The stability verification was implemented in Python using NumPy. After estimating the AR coefficients, ϕ1, ϕ2,…ϕp, the characteristic polynomial coefficients were constructed, and its roots were computed using (Python 3.12.).
The roots were visually inspected by plotting them in the complex plane together with the unit circle (radius = 1). A model was considered stable only if all roots were strictly located within the unit circle. After AR order selection via the AIC (or other criteria), the estimated coefficients were used to compute the characteristic roots, and only models satisfying the stability condition were retained. This procedure ensures both causality and stationarity of the fitted AR model.
Practical Implications for HRV Analysis: In HRV frequency-domain analysis (e.g., LF, HF, LF/HF estimation), the stability of the AR model directly affects the reliability of the estimated spectral density. If roots lie near or outside the unit circle, the PSD may exhibit artificial peaks or exaggerated low-frequency power, compromising physiological interpretation.
When applying AR models to long-term PhysioNet RR interval datasets, stability outcomes depend strongly on window length and preprocessing quality. Fitting an AR model to an entire 24-h recording is generally inappropriate due to pronounced nonstationarity. Therefore, stability verification was performed on short-term windows (5-min segments), consistent with standard HRV methodology.
Dataset-Specific Tendencies: In Healthy Subjects, younger individuals typically exhibit rich and dynamic HRV, particularly strong respiratory sinus arrhythmia (RSA). In resting segments, most roots lie within the unit circle, indicating stable models. However, roots often appear close to the boundary (|z| ≈ 0.95–0.99), reflecting strong periodic components. Such near-boundary roots correspond to pronounced spectral peaks, particularly in the high-frequency (HF) band. In the Autonomic Aging Database, age-related reductions in autonomic modulation result in smaller HRV amplitudes. Consequently, estimated AR coefficients tend to be smaller, and roots are often located closer to the center of the unit circle. From a mathematical perspective, these models behave more stably. Physiologically, however, this may reflect reduced signal complexity rather than improved health status.
Overall, verifying that all characteristic roots lie within the unit circle provides a rigorous and principled validation step in AR-based HRV spectral analysis. It ensures that the estimated spectra are mathematically well-defined and that subsequent physiological interpretations are grounded in stable model dynamics.
4. Discussion
The main finding of this study is that AIC-based AR order selection becomes increasingly sensitive to search-range expansion, leading to order inflation, reduced ΔAIC margins, and increased bootstrap instability. In the present study, our experimental results suggest two distinct and physiologically meaningful findings regarding the behavior of Akaike’s Information Criterion (AIC)-based autoregressive (AR) model order selection under different autonomic conditions. First, in healthy subjects, the estimated optimal AR order showed a pronounced dependence on the maximum search order (pmax). When pmax = 20, the mean optimal order was 6.33 with a median of 5.0. In contrast, when the search range was extended to pmax = 50, the mean optimal order increased markedly to 11.11, and in some cases reached the upper bound of 50. Notably, 17.0% of the segments exhibited estimated orders exceeding 20. This tendency suggests that heart rate dynamics in healthy individuals contain rich and complex temporal structures, such that expanding the parameter search space increases the likelihood that higher-order models are favored by likelihood-based criteria. Consequently, in physiologically complex signals, the estimated “optimal” order becomes sensitive to analysis settings, particularly the choice of pmax. Second, in contrast to healthy subjects, data associated with autonomic aging exhibited markedly greater robustness. When pmax = 20, the mean and median optimal orders were 8.25 and 8.0, respectively. Even when the search range was expanded to pmax = 50, these values changed only marginally (mean 9.07, median 8.0), and the proportion of segments with orders exceeding 20 was limited to 3.8%. This stability indicates that age-related autonomic decline is accompanied by a simplification of heart rate dynamics, such that low-order AR models remain sufficient even when higher-order candidates are permitted. Taken together, these results support the notion that autonomic aging is associated with a reduction in the effective dimensionality of heart rate dynamics, rendering AR model selection less sensitive to the upper bound of the search range. Conversely, in younger and healthier individuals, expanding pmax increases the risk of capturing increasingly fine-grained—and potentially noise-driven—components, thereby amplifying sensitivity to analytical choices. This contrast highlights an important technical caveat: AR model order estimates in healthy populations may be particularly vulnerable to methodological overfitting when broad search ranges are employed.
In this study, we systematically evaluated the robustness of Akaike’s Information Criterion (AIC) in determining the optimal order (p) of autoregressive (AR) models applied to RR-interval time-series data from the PhysioNet Healthy Subjects Database. Although AIC is designed to balance model fit and model complexity, our analysis revealed that its performance becomes unstable when the maximum search order is set excessively high (p = 50). Under such conditions, AIC tended to overestimate the optimal AR order, particularly in noisy or mildly nonstationary physiological signals, indicating a risk of overfitting driven by the expansion of the parameter search space. These findings suggest that relying solely on AIC for order determination can compromise robustness when the search range is overly broad.
To enhance model stability, restricting the maximum allowable AR order or complementing AIC with stricter information criteria—such as the Bayesian Information Criterion (BIC) or the Final Prediction Error (FPE)—is recommended. The risk of inflated order estimation is especially relevant in biological signals such as HRV, where noise, ectopic beats, and short data segments can bias likelihood-based selection. Our results strengthen the view that AR model selection must consider not only statistical optimality but also physiological interpretability.
Previous studies have shown that 9th- to 25th-order AR methods produce statistically similar normalized spectral parameters and have suggested using AR orders of
p = 16 or higher for HRV spectral analysis [
1,
2]. And several investigations have discussed optimal order selection in the context of HRV [
3,
7,
8,
9,
10,
11,
12,
13,
14]. However, prior studies have generally used small laboratory datasets. In contrast, the present study provides novelty by conducting large-scale comparisons using big-data-level RR series, enabling a more comprehensive assessment of AIC behavior across diverse subjects. The motivation for developing robust AR-order optimization methods stems from the need to improve both the accuracy of HRV spectral estimation and the reliability of derived indices such as LF, HF, and the LF/HF ratio [
15,
16,
17,
18,
19,
20,
21]. Although spectral distortions were modest in absolute magnitude, increased variability in LF/HF ratios suggests potential downstream impact in automated analyses. AR models predict current RR intervals from a weighted combination of past values, and the order p determines how many past points are included. Underspecification (small p) risks missing physiologically meaningful frequency components, whereas overspecification (large p) may model noise, producing unstable spectral peaks and inflated variability in HRV indices.
Existing optimization approaches typically focus on (i) information criteria that balance goodness of fit against model complexity, (ii) residual analysis assessing whether residuals approximate white noise, or (iii) FPE-based selection aimed at minimizing prediction error [
22,
23,
24,
25,
26]. While these approaches provide valuable insights, our results highlight the need for more robust and adaptive strategies.
Future Directions: To achieve more stable and physiologically meaningful HRV estimation, future algorithms should consider hybrid approaches that integrate multiple criteria rather than relying on a single metric. Moreover, adaptive order-selection methods that account for individual differences (e.g., age, autonomic function, noise characteristics, or data length) may provide improved performance [
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38]. Machine learning-based meta-criteria or Bayesian hierarchical modeling may also offer promising avenues for individualized AR model selection [
39,
40].
This study has several methodological and interpretative limitations that should be acknowledged.
Dataset Dependence and Signal Specificity: Although large-scale RR-interval datasets were analyzed, the conclusions are restricted to heart rate variability (HRV) time series derived from ECG recordings. The findings may not generalize to other physiological signals such as respiration, blood pressure variability, or multimodal autonomic indices, which may exhibit different spectral structures and noise characteristics. Moreover, only specific public databases were examined, and population diversity (e.g., pathological cohorts, stress paradigms, or controlled breathing conditions) was limited.
Nonstationarity Handling: HRV signals are inherently nonstationary, particularly in long-duration recordings. In the present study, we did not explicitly segment or correct for nonstationary epochs prior to autoregressive (AR) modeling. Because information criteria such as AIC assume local stationarity, unresolved nonstationarity may have influenced the behavior of the criterion—especially at higher model orders—potentially contributing to apparent order inflation. Future analyses incorporating formal stationarity testing, adaptive segmentation, or time-varying AR approaches would strengthen interpretability.
Single-Criterion Emphasis: The primary empirical analysis focused on the Akaike Information Criterion (AIC). Although alternative criteria such as BIC, FPE, or cross-validation-based approaches were conceptually discussed, direct comparative validation was not performed. Because different criteria impose distinct complexity penalties, the robustness of the present findings across selection frameworks remains to be empirically established.
Lack of Formal Residual Diagnostics: While model order behavior was evaluated in terms of selection stability and spectral outcomes, formal residual diagnostics were not systematically conducted. In time-series modeling—particularly in ARIMA or SARIMA frameworks—post-estimation validation commonly includes tests such as the Ljung–Box test to verify the absence of residual autocorrelation. Such diagnostics help confirm whether the fitted model adequately captures the temporal dependence structure. Incorporating Ljung–Box residual analysis would allow discrimination between physiologically meaningful complexity and noise-driven overfitting, especially at high AR orders. The absence of this step represents a methodological limitation and an important direction for future refinement.
Spectrum-Level Impact Not Exhaustively Quantified: Although order overestimation and variability in the LF, HF, and LF/HF indices were demonstrated, the downstream impact on clinical classification accuracy, risk stratification, and machine learning model performance was not comprehensively evaluated. Even moderate spectral shifts may influence threshold-based decision systems; however, the practical magnitude of this effect remains to be systematically quantified.
Despite these limitations, this work represents one of the first large-scale investigations into the robustness of AIC-driven AR order selection in HRV analysis. By highlighting sensitivity to search-range expansion, boundary-selection effects, and instability indicators, the study provides a methodological foundation for developing more reliable and physiologically interpretable AR modeling frameworks. Future work integrating residual diagnostics, cross-criterion validation, and application-level impact assessment will further advance methodological standardization in HRV research.