1. Introduction
Atmospheric models use historical data to characterize climate dynamics and to understand the interactions among meteorological variables. Among these variables, wind speed is especially relevant because its variations affect others such as temperature, atmospheric pressure, and humidity [
1]. These variables are important for public policy and decision-making, particularly in the health sector, where significant links between extreme weather conditions and increased mortality and morbidity have been documented [
2]. Likewise, climate behavior affects key economic sectors such as agriculture [
3], water management [
4], tourism [
5], and energy production [
6,
7], all of which depend strongly on environmental conditions for planning and operation.
Climate change is a reality that makes it difficult to estimate climate variables, such as the intensity of rainfall and drought periods, as well as the occurrence of extreme events, such as hurricanes, heat waves, and forest fires, which demand automated decision-support tools [
8,
9]. One of the main challenges in climate prediction is modelling the stochastic nature of climate variables, which affects model accuracy and, consequently, decision-making. Likewise, deterministic models struggle to represent the dynamics and temporal variability of wind speed [
10]. Moreover, local topographic conditions introduce additional variations, further complicating modeling.
The temporal variation in climate variables leads to non-stationary, quasi-periodic behavior over extended time intervals [
11]. Nonetheless, these quasi-periodic patterns permit local approximations by assuming stationarity over shorter time intervals [
7,
12,
13]. Local stationarity enables the use of stationary statistical methods to obtain reasonable predictions and detect useful trends [
14]. However, treating wind speed as a complex environmental variable introduces an additional challenge: efficiently estimating its PDF [
15]. In this context, probabilistic models provide a natural alternative because they allow the data to be represented through parametric PDFs and facilitate the identification of stochastic structures in the observations.
The methodological objective of this study is not stochastic modeling in general, but rather the estimation of PDFs at the local scale for each site, accounting for non-stationary, spatially heterogeneous wind conditions. In urban areas, measurements of climatic variables such as wind are influenced by local exposure, building topography, and terrain. This can cause their statistical behavior to vary considerably from one site to another and over time. Under these conditions, a single global density or a fixed mixture of components may not adequately reflect the following challenges: local tails, secondary modes, or changes in the uncertainty structure. This motivates a framework that first identifies a locally quasi-stationary regime and then estimates the corresponding density through an adaptive parametric structure.
The current approaches for estimating density in environmental series can be broadly grouped into three categories. First of all, global parameter models adopt a predetermined functional shape and, frequently, a fixed set of components, which may be unduly restrictive in heterogeneous local regulations. Second, non-parametric estimators such as KDE offer flexibility, but their performance depends heavily on bandwidth selection and may oversmooth multimodal or heavy-tailed distributions. Third, methods based on local stationarity provide a principled way to isolate intervals with approximately stable statistics, but, by themselves, they do not define how the resulting densities should be estimated adaptively. The gap addressed in this work lies in combining local regime identification with an adaptive parametric density estimator that does not require the number of components to be fixed in advance.
The automatic construction of stochastic models involves designing probabilistic models whose structure and parameters are inferred from data. This approach minimizes human intervention and avoids imposing rigid global assumptions. Whenever new observations are added, the model is updated iteratively, allowing it to adapt to the local patterns, temporal variability, and changes in the dynamics of the underlying process [
16,
17]. In urban climate applications, local wind behavior may vary substantially from one site to another due to topography, land cover, and exposure conditions, which justifies the use of site-specific stochastic modeling strategies [
18,
19]. In this context, the automatic construction of a stochastic model from historical data should consider three stages: (a) identifying local stationarity; (b) estimating the parameters of the PDF through parametric structures; and (c) validating the quality of the resulting fit.
First, (a) identifying local stationarity aims to find intervals where the statistical properties remain roughly constant. This enables the use of segment-wise stationary models [
12,
14]. To estimate a dominant periodicity, spectral analysis is employed, and the induced subsampling is then statistically validated using the Augmented Dickey–Fuller (ADF) test [
20,
21]. The corresponding fundamentals and formal definitions are presented in
Section 2. Subsequently, (b) the parameters of the PDF are estimated using a parametric mixture model. This approach prioritizes recursive updates, enabling incremental adjustments based on identified segments [
16,
17]. Finally, (c) the fit is validated through discrepancy criteria and residual analysis, and is further supported by comparisons with established reference methods.
The Metropolitan Area of Queretaro (MAQ), Mexico (see
Figure 1), is used as a case study for two main reasons: (i) the need to characterize the local variability of meteorological variables in urban environments for improved monitoring and decision-making, and (ii) the availability of high-frequency and multi-site measurements from a local meteorological network. This motivation is consistent with international guidelines for integrated urban services and urban climate monitoring, which emphasize the value of high-resolution local observations in cities [
22,
23]. In particular, this study uses the RedCIAQ-UAQ network, which provides minute-by-minute climatological data and constitutes a suitable source for evaluating automated methods based on temporal segmentation and density estimation [
24].
This article presents SICABI (in the Zapotec language of the Isthmus of Tehuantepec, “sicabi” means “like the wind,” symbolizing the dynamic and adaptive nature of the proposed framework) (Stochastic Inference of Computationally Adaptive Behavioral Structures), a framework for the automatic construction of stochastic models from raw data acquired at MAQ weather stations. The dataset includes geographic information for each site, together with minute-by-minute measurements collected at eight stations from 1 January 2023 to 31 December 2023. The climatic records were aggregated every 10 min, and the model identifies local stationarity periods at each geographic location. The framework then estimates the corresponding density using a combination of parametric functions via a recursive version of the Expectation-Maximization (EM) algorithm. In this way, the resulting model changes according to the prevailing local regime at each site, which is conditioned by topography and wind dynamics.
In light of the above, the main contributions of this work are as follows:
SICABI, a two-stage framework for stochastic density estimation under locally quasi-stationary regimes in environmental time series.
ISR (Identification of Stationary Regions), a procedure that identifies a dominant local period through spectral analysis and validates the induced subsampling using the ADF test.
RAPID (Recursive Adaptive Parametric Density Estimation), an adaptive density estimator that recursively updates mixture parameters and creates new components when the local membership criterion is not satisfied.
An empirical evaluation against KDE and MoG under a common protocol, illustrating the trade-off between adaptive complexity, fitting accuracy, and computational cost in an urban climate case study.
3. Materials and Methods
This section describes the operational methodology of the SICABI framework and its computational implementation. The identification of local stationarity is based on
Section 2.1. In contrast, the formulation of the parametric mixture and the adaptive scheme are based on
Section 2.2 and
Section 2.3. The data source, the general processing flow, the implemented algorithms, and their computational complexity are detailed below.
To facilitate the reading of the methodological sequence,
Figure 2 summarizes the operational workflow of the proposed framework. The process begins with the acquisition and regularization of meteorological records, continues with the identification of locally quasi-stationary regimes through ISR, and concludes with adaptive density estimation through RAPID and its comparison against reference methods.
3.1. Study Area and Data Description
The Metropolitan Area of Queretaro (MAQ), Mexico, was selected as the study area due to its urban and peri-urban heterogeneity, which affects local atmospheric behavior. The MAQ is located in the Queretaro Valley and its geographical transition zones, including the Buenavista Valley to the north and the Amazcala Valley to the east. These differences in topography, exposure, and local relief contribute to spatial variability in wind dynamics, even among relatively nearby sites. Such heterogeneity is especially relevant in urban environmental monitoring, where wind influences ventilation, gas transport, and pollutant dispersion [
32]. Under these conditions, location-specific probabilistic schemes are more appropriate than uniform global assumptions [
22,
23].
Historical records from automatic weather stations in the RedCIAQ–UAQ network were used to construct the stochastic model. The network provides real-time meteorological observations at minute-level resolution [
24,
33]. In this study, wind speed was selected as the target variable. For spatial contextualization, two complementary sources of information were considered: (i) wind-speed time series by site, and (ii) static geographic metadata for each station, including latitude, longitude, and elevation above mean sea level.
Figure 1 presents the geographical location of the study area and the spatial distribution of the eight monitored stations.
Table 1 summarizes the geographic attributes of each site together with descriptive statistics of wind speed. This organization allows the spatial context, the monitored locations, and the observed variability to be interpreted jointly.
Table 1 reveals marked spatial heterogeneity in wind behavior across the monitored sites. Aeropuerto exhibits the highest average wind speed (4.323 m/s), suggesting a more persistent flow regime, whereas Amazcala shows the lowest average value (0.438 m/s), consistent with weaker local circulation. La Griega records the highest maximum wind speed (22.0 m/s), indicating the occurrence of stronger extreme events, while variability also differs substantially among sites, with standard deviations ranging from 1.307 to 2.388 m/s. These differences support the use of site-specific stochastic modeling strategies instead of a single homogeneous probabilistic representation for the entire MAQ.
The apparent temporal distinction between the geographic descriptors and the meteorological statistics is due to their different roles in the analysis. Latitude, longitude, and elevation are fixed attributes of each station, whereas the wind-speed statistics correspond to the observation period considered in this study, namely from 1 January 2023 to 31 December 2023. This distinction was made explicit to avoid confusion between static spatial metadata and time-dependent meteorological measurements.
Operationally, the records were organized into standardized time series by site, denoted by
. Before modeling, basic quality-control procedures were applied, including timestamp synchronization, duplicate removal, and treatment of missing values by linear interpolation. To homogenize the analysis and reduce high-frequency variability, the minute-level observations were aggregated to a 10 min resolution using the operator
, defined as the mean over non-overlapping windows. This produced a regular sequence
for each site
s, which was used as input to the ISR stage shown in
Figure 3.
3.2. General Scheme of the SICABI Framework
Figure 3 summarizes the operational flow of the SICABI framework. Conceptually, the assessment of local stationarity is based on
Section 2.1. On the other hand, the formulation of the parametric mixture and the adaptive scheme is based on
Section 2.2 and
Section 2.3, respectively. This framework summarizes the operational process that helps discover, in non-stationary scenarios, local stationarity, enabling the creation of a stochastic model locally explained by parametric approaches.
This chapter describes the specific parameterization and computational implementation that materialize these foundations.
3.3. Algorithm Implementation
The SICABI framework implementation consists of two main steps: ISR and RAPID (
Figure 3). ISR performs local stationarity identification using spectral analysis and inferential validation (see
Section 2.1). Subsequently, RAPID constructs a density estimate based on a mixture of parametric functions with recursive updating and dynamic growth of the number of components (see
Section 2.2 and
Section 2.3). This section details the parameterization decisions, the algorithmic structure, and the computational implementation aspects. The implemented versions ISR, and RAPID are presented below (see Algorithms 1 and 2, respectively).
| Algorithm 1 Identification of Stationary Regions (ISR) |
- Require:
x
| ▹ Historical data of length N. |
- Ensure:
| ▹ Estimated dominant stationarity period (or ). |
- 1:
procedure ISR(x)
|
- 2:
| ▹ Fast Fourier Transform |
- 3:
| ▹ Dominant period from spectral power |
- 4:
|
▹ |
- 5:
|
- 6:
if then
|
- 7:
return
|
- 8:
else
|
- 9:
return
| ▹ Fallback policy described in text (Section 3.4/below)
|
- 10:
end if
|
- 11:
end procedure
|
| Algorithm 2 Recursive Adaptive Parametric Density estimation (RAPID) |
- Require:
|
▹ Subseries sampled every units.
|
- Require:
|
▹ Belonging threshold.
|
- Ensure:
|
▹ Mixture parameters .
|
- 1:
procedure RAPID()
|
- 2:
Initialize
|
▹ First component
|
- 3:
|
▹ Learning factor
|
- 4:
for all do
|
- 5:
|
▹ Nearest component and distance
|
- 6:
if then
|
- 7:
|
▹ Update |
- 8:
else
|
- 9:
|
▹ New component
|
- 10:
end if
|
- 11:
end for
|
- 12:
return
|
- 13:
end procedure
|
Once the dominant periodicity is identified and the subsampling’s stationarity is validated using ISR, RAPID fits an adaptive parametric mixture. The implementation employs three operators:
(initialization),
(assignment/membership criterion), and
(recursive parameter update), in accordance with the theoretical formulation of
Section 2.3.
Algorithm 1 assumes that the data corresponds to complete observations, considering the level of noise inherent in the acquisition and/or transmission process. For the estimation of stationarity, the operator
calculates the spectral power over the positive frequencies and determines the dominant period
according to Equation (
8):
The spectral power
and
defined in Equation (
2). The operator
calculates the dominant period excluding the DC component (
) and limiting it to positive frequencies:
Equation (
9),
, corresponds to the index of the frequency with the highest energy (other than DC), and
represents the dominant periodicity in the number of samples. This estimate is treated as a sampling candidate and subsequently validated inferentially using ADF in ISR (
Section 2.1).
Regarding RAPID, the
operator initiates a component with a mean of
and an initial variance of
. The
operator determines whether
belongs to the set using the threshold
, and returns the nearest component
along with the minimum distance
. Finally,
recursively updates the parameters of the selected component, according to the learning rate
(
Section 2.3).
3.4. Configuration and Hyperparameters
Table 2 summarizes the hyperparameters set for ISR and RAPID. The same applies to the comparison protocol (KDE/MoG). To ensure reproducibility and comparability, these values were kept unchanged at all sites; in contrast, site-dependent outputs (e.g.,
and
) are reported in
Section 4.3.
If the ADF test applied to does not reject the null hypothesis of unit root (i.e., ), alternative spectral peaks are not explored. Instead, (no subsampling) is set, and RAPID is continued, recording this case for analysis in the results section.
3.5. Implementation Issues
This subsection discusses the computational complexity of the three implemented algorithms. The complexity is addressed in three main stages:
Spectral segmentation using FFT;
Statistical validation of stationarity and periodicity selection in ISR;
Density estimation with adaptive growth in RAPID.
Hereafter, N denotes the length of the aggregated series per site. Let be the time required to compute the FFT of a series of size N. The recurrence of the Cooley-Tukey scheme is , which is solved as . This complexity improves upon direct DFT calculations, which have a cost of .
ISR applies FFT with a complexity of
. Estimating the dominant period
, which inspects spectral magnitudes, requires
. Subsampling every
units consumes
, and in the worst case (when
) it is
. The ADF test, for a fixed number of delays, is typically considered to be
. Consequently, ISR is FFT-dominated:
Let
be the subseries (per site
s) obtained by the subsampling induced by
, and let
its length. Under regular subsampling conditions,
, where
N denotes the length of the aggregated series
.
The initialization of has a constant complexity. The main cycle records M observations. In these cases, if the number of components increases linearly with t, the evaluation continues with the loss criterion at . In typical situations where the number of components is stabilized in , the estimate of the computer cost is calculated in .
Therefore, the worst-case computational complexity of RAPID is defined as
In the usual form where
K stabilizes and
is bounded, the cost
is nearly linear in
N (provided that
K is bounded).
ISR has dominant complexity . Let be the length of the subseries, with under regular subsampling. RAPID has a worst-case cost of and typically when the number of components stabilizes at .
The global cost is expressed as
and, in the typical regime with a bounded value of
K, a final complexity is obtained, described as
Although the worst theoretical case for RAPID is
when the number of components grows without bound, in the configuration used (
), the number of components stabilizes at a small value
(see
Table 3). In this typical regime, the cost of the density estimation phase in RAPID is approximated by
, and when considering
bounded (e.g.,
in our experiments), the complexity is effectively linear in the input size:
This point is relevant for comparison, since ISR is applied as common preprocessing to construct
, and both KDE and MoG are fitted on the same subseries (
Section 4.2). Consequently, the cost disparities observed in the experiments primarily reflect the complexity of the density phase of each estimator. In this phase, RAPID operates incrementally with an effective linear cost in
M when
is restricted.
5. Discussion
A central result of this study is the marked spatial variability in the dominant wind regime across the MAQ sites, as shown in
Table 4. This finding is consistent with the literature on environmental time series, where long records often fail to satisfy global stationarity assumptions but may still admit locally stationary approximations over shorter intervals [
11,
12,
14]. In the present case, the differences observed among stations indicate that wind dynamics in the MAQ should not be represented through a single homogeneous probabilistic structure, since local exposure, relief, and urban conditions affect both temporal behavior and distributional shape.
From this perspective, ISR is not only a preprocessing stage but also an interpretative component of the framework. By identifying a dominant period for each site and validating the induced subsampling through the ADF test, ISR establishes a connection between temporal structure and subsequent density estimation. This is relevant because, when temporal heterogeneity is ignored, a global fit may combine observations from distinct local regimes and distort the resulting PDF. Therefore, the use of
provides a more coherent basis for stochastic modeling and is aligned with the local-stationarity rationale reported in the literature [
12,
14].
The residual analysis supports this interpretation. Although several stations exhibit residuals centered near zero, others show asymmetry and heavy tails, as in the case of Amazcala (
Table 5), suggesting that local wind behavior is not only heterogeneous in time but also structurally complex in distributional terms. Under such conditions, RAPID offers a practical advantage because it does not require the number of mixture components to be fixed in advance. Instead, the model adapts its structure through recursive updating and dynamic component creation, yielding
across sites (
Table 3). This behavior is consistent with the broader literature on adaptive and mixture-based modeling, where recursive updating and the ability to represent multimodality are especially valuable in complex stochastic environments [
16,
17,
30].
Under the homogeneous comparison protocol defined in
Section 4.2, RAPID achieved the lowest RMSE at 6 out of 8 sites (
Table 7) and the lowest MISE at 5 out of 8 sites (
Table 8). These results suggest that the main strength of RAPID is not the optimization of a single metric in isolation, but rather its capacity to maintain a favorable balance among local adaptability, structural interpretability, and fit stability. In contrast, KDE depends strongly on bandwidth selection, whereas MoG requires model selection and iterative EM fitting over a predefined range of components. Within this comparison, RAPID preserves the interpretability of a parametric mixture while avoiding the need to predefine a fixed value of
K.
The sites at which MoG attained lower MISE values, particularly Amazcala, La Griega, and Milenio III, are also informative. These sites appear to involve more demanding distributional configurations, where the integrated error is more sensitive to tail behavior, multimodality, or the allocation of mixture components. Rather than contradicting the proposed method, these results help delimit the conditions under which further refinement of RAPID may be beneficial. In particular, they suggest that an automatic adjustment of and a more systematic optimization of the initialization parameters could improve integrated fidelity without sacrificing the adaptive behavior already observed. These cases also suggest that the sequential nature of RAPID may entail a trade-off between local adaptability and global integration fidelity.
This pattern also makes explicit what may be termed the cost of recursivity in RAPID. Because the estimator is built sequentially, each update is decided from the current component configuration and the incoming observation, which favors adaptive growth and low computational cost. However, once new components are created, RAPID does not perform a full joint re-optimization of all means, variances, and weights over the complete support. As a result, the method may lose global integration fidelity in comparison with batch-optimized mixtures, particularly in tails or low-density regions between components. In some demanding sites, such as Milenio III, this limitation may also affect pointwise fit, as reflected by the fact that RAPID is outperformed by KDE in both RMSE and MISE. By contrast, the batch EM–BIC strategy used in MoG optimizes the mixture globally and can therefore redistribute component structure in a way that better reduces integrated error across the support. In this sense, the lower MISE values attained by MoG at some sites, together with the Milenio III behavior relative to KDE, should be interpreted as evidence of the trade-off between recursive local adaptability and global mixture optimization.
Another relevant advantage of RAPID is its computational profile. As shown in
Table 9, RAPID was consistently more efficient than KDE and MoG under the same experimental conditions. This difference is methodologically meaningful because the comparison used the same input
, the same evaluation grid, and the same discrepancy metrics. The lower cost of RAPID follows from its sequential assignment-and-update mechanism, whereas KDE requires global smoothing and MoG involves repeated EM fitting together with BIC-based model selection. This feature makes RAPID especially attractive for operational or continuously updated settings, where recalibration must be carried out at low computational cost.
To summarize these methodological differences under the common evaluation protocol,
Table 11 compares RAPID with KDE and MoG in terms of estimator type, structural adaptability, interpretability, sensitivity to hyperparameters, and computational profile.
Overall, the discussion supports two main conclusions. First, the MAQ case study confirms that local stochastic structure matters when modeling urban wind behavior. Second, RAPID provides a competitive alternative to standard reference methods by combining adaptive structure, recursive estimation, interpretability, and reduced computational cost. These properties position the method as a useful site-specific modeling strategy for non-stationary environmental signals.
In Symmetry Perspective
Within this work, the notion of symmetry is restricted to an approximate local invariance under time translation. Let
denote the wind-speed process at site
s. For a local regime
identified by ISR, this invariance is understood in the weak stochastic sense: for admissible shifts
within
,
and
for relevant lags
h inside the same regime. Under this interpretation, the symmetry considered here is not geometric or flow-based, but temporal and statistical, and corresponds to the local quasi-stationarity assumption used by SICABI.
From this perspective, ISR searches for a time scale at which the process can be represented through approximately invariant first- and second-order statistics. Thus, acts as a descriptor of the local regime and establishes the temporal scale at which a weak form of translational invariance is considered plausible.
Departures from this approximate local invariance are reflected in residual asymmetry, heavy tails, multimodality, and cross-site heterogeneity. In this sense, RAPID does not model symmetry breaking in a formal group-theoretic way; rather, it provides an adaptive mechanism to represent increasing distributional complexity once a simple locally invariant description is no longer sufficient. Under a fixed estimation protocol, the number of components
can therefore be interpreted as a relative complexity indicator: larger values of
imply that more adaptive structure is required to represent the local density at site
s. A simple surrogate measure of this departure is
These quantities should be understood as operational indicators of departure from a simple locally invariant regime, not as formal order parameters derived from an explicit symmetry group.
6. Conclusions
This work addressed the estimation of wind-speed PDFs in urban environments under non-stationary and spatially heterogeneous conditions. To this end, SICABI was used as a two-stage framework: ISR identified a dominant local regime through spectral analysis and ADF-based validation, and RAPID estimated the corresponding density through an adaptive parametric mixture. The analysis of eight sites in the Metropolitan Area of Queretaro showed that the dominant period is markedly site-dependent, with values ranging from 37 to 166 days, which confirms that the temporal structure of the signal cannot be represented adequately by a single global regime.
The quantitative results support the proposed approach. RAPID adapted its structural complexity across sites with , without requiring the number of components to be fixed in advance. Under the common comparison protocol, RAPID achieved the lowest RMSE at six of the eight monitored sites and the lowest MISE at five of the eight sites. In addition, RAPID showed a lower computational cost than KDE and MoG under the same input data and evaluation conditions. These results indicate that the method provides a favorable balance between local adaptability, parametric interpretability, and computational efficiency.
From a modeling perspective, the results show that density estimation based on local stochastic structures constitutes a coherent alternative for non-stationary environmental signals. In contrast to global representations, RAPID adjusts the density according to the regime detected at each site, which makes it better suited to heterogeneous urban wind behavior. Although MoG attained lower MISE values at some sites, the overall results suggest that RAPID is competitive as a site-specific estimator and particularly attractive when adaptive structure and low computational cost are required.
Overall, the MAQ case study supports two main conclusions: first, local stochastic structure matters when modeling urban wind behavior; and second, RAPID provides a practical and interpretable alternative to standard reference methods for density estimation in non-stationary climate signals.
Limitations and Future Work
The present study has several limitations that delimit the current scope of the proposed framework. Firstly, the analysis was conducted in a univariate setting, so the model does not yet represent cross-variable or cross-site dependencies. Secondly, the performance of SICABI depends on the correct identification of the dominant local regime through ISR, which may become less reliable when the signal contains multiple regime transitions, several competing spectral peaks, or more irregular temporal structures. Thirdly, the empirical reference density was approximated through a histogram evaluated on a 50-point regular grid, which may affect the absolute magnitude of RMSE and MISE depending on the discretization used. Finally, the case study was restricted to eight sites within a specific observation period, which limits the direct generalization of the results to other regions, periods, or meteorological variables.
These limitations suggest several immediate research directions. A first extension is to generalize RAPID toward a multivariate and potentially spatial formulation capable of representing dependencies among variables and stations. A second direction is to incorporate automatic hyperparameter adjustment, particularly for and , together with more robust strategies for constructing and the evaluation grid. A third line is to extend ISR so that it can accommodate multiple dominant periodicities, either by selecting several relevant spectral peaks or by combining the current procedure with sliding-window analyses. This extension is especially relevant for those sites where the signal may not be adequately summarized by a single dominant regime. Finally, an applied extension of the framework is its integration into a real-time monitoring environment, where stability, latency, missing-data tolerance, and robustness to acquisition noise can be evaluated under operational conditions.