A Hybrid Algorithm Combining Wavelet Analysis and Deep Learning for Predicting Agroclimatic Pest Infestations

Akanova, Akerke; Ospanova, Nazira; Muratova, Gulzhan; Sharipova, Saltanat; Tokzhigitova, Nurgul; Anarbekova, Galiya

doi:10.3390/a19030242

Open AccessArticle

A Hybrid Algorithm Combining Wavelet Analysis and Deep Learning for Predicting Agroclimatic Pest Infestations

by

Akerke Akanova

¹,

Nazira Ospanova

^2,*,

Gulzhan Muratova

^1,*

,

Saltanat Sharipova

³

,

Nurgul Tokzhigitova

² and

Galiya Anarbekova

¹

Computer Science, Business and Digital Technologies, Kazakh Agrotechnical Research University Named After S. Seifullin, Astana 010000, Kazakhstan

²

Department of Information Technology, Faculty of Computer Science, Toraighyrov University, Pavlodar 140000, Kazakhstan

³

School of Software Engineering, Astana IT University, Astana 010000, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Algorithms 2026, 19(3), 242; https://doi.org/10.3390/a19030242

Submission received: 3 February 2026 / Revised: 19 March 2026 / Accepted: 21 March 2026 / Published: 23 March 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

Forecasting crop pest outbreaks under conditions of increasing agroclimatic variability is a critical task for intelligent decision support systems in agriculture. Traditional statistical and empirical models typically have limited transferability and insufficient accuracy when describing nonlinear and multiscale relationships between climatic factors and pest population dynamics. This paper proposes a hybrid algorithm combining wavelet analysis and deep learning methods for forecasting agroclimatic pest infestation levels. The algorithm is based on multiscale decomposition of time series using a discrete wavelet transform, after which the extracted components are used as input features for a deep neural network implementing a nonlinear mapping between climatic parameters and infestation indicators. The developed computational framework includes the stages of data preprocessing, feature space formation, model training, and forecast generation in a single, reproducible pipeline. An experimental evaluation using long-term agroclimatic and phytosanitary data showed that the proposed algorithm outperforms classical regression and individual neural network models in terms of RMSE, MAE, and the coefficient of determination. The results confirm the effectiveness of integrating wavelet analysis and deep learning for developing phytosanitary risk forecasting algorithms and demonstrate the potential of the proposed approach for implementation in intelligent precision farming systems.

Keywords:

hybrid algorithm; wavelet analysis; deep learning; time series forecasting; pest infestation prediction; agroclimatic variability; intelligent decision support; precision agriculture

1. Introduction

Sustainable agricultural development and food security increasingly depend on effective computational algorithms capable of early detection of phytosanitary threats in the face of increasing agroclimatic variability. The increasing frequency of extreme weather events, unstable temperature regimes, and changing precipitation patterns significantly increases the likelihood of pest outbreaks. However, traditional forecasting methods based on thresholds and expert schemes are characterized by low transferability and limited accuracy, especially when applied to new climatic conditions and heterogeneous agricultural landscapes [1,2,3]. This necessitates the development of algorithmic models capable of accounting for nonlinear dependencies and the multiscale structure of climate signals [4,5].

In recent years, machine learning methods and deep neural networks have become a key technological component of intelligent decision support systems in agriculture. Algorithmically, deep models provide approximation of complex nonlinear functions, robustness to noise and outliers, and the ability to learn from high-dimensional data, making them effective for modeling bioclimatic processes [6,7,8]. However, the task of predicting pest population levels remains methodologically and computationally complex, as population dynamics are determined by the combined influence of early spring weather conditions, extreme summer anomalies, pest biology, and differences in agricultural practices [9]. These factors generate non-stationary and heterogeneous time series that are difficult to describe within a single class of models.

One of the most promising algorithmic approaches is the construction of hybrid architectures that combine statistical time series models with deep learning components. Wavelet transforms enable algorithmic decomposition of signals into trend, seasonal, and high-frequency components [10]. LSTM and GRU recurrent neural networks effectively model temporal dependencies with strong autocorrelation [11], while transformer architectures provide an attention mechanism for identifying long-range dependencies and complex nonlinear relationships in agroclimatic sequences [12,13,14]. Several studies show that the integration of these approaches leads to a significant increase in the accuracy of agrometeorological forecasts [15].

Despite the progress made, most existing studies are limited to individual model families or partial hybrid schemes and rarely offer a unified algorithmic pipeline that combines multiscale signal decomposition, statistical modeling, and deep neural architectures in a reproducible computational framework. This problem is particularly pressing for Kazakhstan, where grain crops play a key role in agricultural production and are characterized by high interannual and intraannual climatic variability. Traditional approaches used by agricultural services based on observations, phenological data, and simple relationships between weather factors and pest numbers are limited by linearity, low feature dimensionality, and poor adaptability to changing climatic conditions.

This paper proposes a new hybrid algorithmic framework based on a multibranch Wavelet–CNN–SARIMA–GRU–STTransformer architecture. The main contributions of the study are: (i) a multi-branch hybrid algorithm that combines SARIMA residual modeling, multi-scale wavelet decomposition, convolutional feature extraction, recurrent learning, and a transformer attention mechanism in a single computational framework is developed; (ii) ten baseline and advanced models, ranging from linear regressions and ensembles to wavelet CNN and recurrent hybrids, are systematically compared within a single reproducible protocol; (iii) the reproducibility of experiments is ensured by a fixed data partitioning, common normalization procedures, and controlled random number generators; (iv) a harmonized panel dataset of agroclimatic and phytosanitary data of Kazakhstan for 2005–2022 is used; (v) a significant increase in forecasting accuracy was achieved (RMSE = 0.0223; MAE = 0.0178; R² = 0.9466) compared to alternative models; (vi) the interpretability of the algorithm was preserved, and key climatic factors influencing seasonal pest infestations were identified.

The aim of the study is to develop and validate a highly accurate hybrid algorithm for forecasting the interannual dynamics of summer pest infestations of grain crops based on agroclimatic data. To achieve this goal, a consistent set of agroclimatic and phytosanitary indicators was formed, a multiscale wavelet feature representation was created, convolutional, recurrent, and transformer modules were integrated into a single architecture, and a quantitative assessment of forecasting accuracy was conducted. The obtained results demonstrate the high potential of the proposed algorithm for implementation in precision farming systems, digital platforms for pest risk analysis, and intelligent early warning systems.

Section 2 describes the data used, including climate predictors, seasonal aggregates, and target phytosanitary indicators, and substantiates their relationship with the biological mechanisms of pest development. Section 3 details the research methodology, data preparation procedures, model configurations, and the experimental protocol used for their objective comparison. Section 4 presents the results of the statistical analysis, visualization, feature importance interpretation, and an assessment of the accuracy of all studied models, including the proposed hybrid architecture. Section 5 presents a discussion of the identified patterns, a comparison with modern scientific works, and an analysis of the advantages of the proposed approach. Finally, Section 6 presents the main conclusions of the study and formulates key directions for the further development of methods for predicting phytosanitary risks.

2. Related Works

In recent years, research on the application of hybrid models and deep learning technologies to agroclimatic forecasting and phytosanitary risk modeling has increased significantly. Research over the past three years has demonstrated that combining statistical time series models with neural network architectures can dramatically improve forecast accuracy and robustness to climate variability. For example, the use of hybrid ARIMA and convolutional networks achieves high performance in seasonal agrometeorological forecasting [16], while integrating wavelet decomposition into hybrid neural network architectures enhances models’ ability to extract multiscale components of climate signals [17].

Several studies have shown that recurrent architectures, such as LSTM and GRU, more accurately reconstruct complex temporal dependencies in agroclimatic series, including pest dynamics, soil moisture, and plant stress indices [18]. At the same time, the field of transformer architectures and their spatiotemporal modifications is actively developing, demonstrating excellent ability to capture long-range temporal and spatial patterns in climate data [19,20]. Such models are increasingly used to predict pests, droughts, extreme agrometeorological events, and bioclimatic risks [21].

Particular attention is being paid to hybrid architectures based on wavelet transforms, which effectively separate data into low- and high-frequency components, enabling neural networks to perceive seasonal structure and extreme deviations better. Work in 2022–2024 demonstrates that such models are successfully applied to monitor agroecosystems, assess soil moisture, and analyze bioclimatic time series [22,23]. Furthermore, modern machine learning methods combined with agroclimatic indicators can significantly improve the forecasting of pest outbreaks and yield more robust results than traditional agricultural models [24,25].

Taken together, these studies highlight the robust development of a field focused on integrating statistical, neural network, and multiscale approaches. This confirms the scientific and technological significance of the Wavelet-CNN-SARIMA-GRU-STTransformer hybrid architecture proposed in this paper, which aligns with modern trends in intelligent agricultural analytics and represents a logical extension of existing methods.

However, despite significant progress, the studies presented address specific problems and are typically limited to either a single model type (e.g., recurrent networks, transformers, or wavelet hybrids) or to climate data with limited structure and insufficient temporal coverage. In contrast, the architecture proposed in this paper combines several complementary components: the seasonal SARIMA statistical block, multiscale wavelet decomposition, convolutional layers for extracting local frequency patterns, the GRU recurrent module, and the spatiotemporal STTransformer, allowing us to simultaneously consider the trend, stochastic, high-frequency, and spatiotemporal structures of the data. An additional distinction is the use of a single-panel dataset spanning 18 years, with aggregated seasonal climate features optimized explicitly for forecasting biological indices, along with a rigorous comparison protocol for 10 alternative models. This integration of multilevel time series representations, comprehensive model evaluation in a panel structure, and the combination of several architectural classes provides our work with a significant advantage in forecast accuracy, interpretability, and robustness compared to existing approaches.

3. Materials and Methods

3.1. Data Description and Feature Construction

The study used a harmonized panel dataset (https://drive.google.com/file/d/1_TruOn8lv0X212fxnAvYHLQjihAYsXQA/view?usp=sharing) (accessed on 9 October 2025) of long-term field observations on infestations of grain crops by the pest Phyllotreta vittula (the striped flea beetle) in Kazakhstan. To construct the input data for the hybrid temporal model, the original panel dataset (at the annual and spatial unit levels) was transformed into a single annual time series. For each year, the pest infestation index was aggregated across spatial units using an area-weighted average based on the surveyed crop area, ensuring that regions with larger agricultural areas contributed proportionally to the annual index. Climate predictors were similarly aggregated by calculating spatially weighted seasonal means (April–August) for temperature and precipitation. This aggregation procedure preserves the interannual biological signal while smoothing local spatial noise, thereby creating a consistent annual sequence suitable for wavelet decomposition and subsequent temporal modeling. The resulting time series (2005–2022) serves as the main input for all hybrid architectures. The basic unit of observation is a pair

(y, i)

, where y is the calendar year and

i

is the identifier of the surveyed district or agricultural unit. For each

(y, i)

, the working dataset contains:

Year of observation;
Spatial identifier ID linked to the regional directory;
Categorical attributes of region (RegionName) and area (OblastID) have been added based on the ID.
Agroclimatic indicators for the months of April-August;
Summer pest infestation and abundance indicators;
Area of surveyed crops in summer.

Thus, the data have a year-area panel structure, allowing for simultaneous analysis of the temporal dynamics and spatial heterogeneity of damage. The primary goal of the modeling is to predict the level of summer pest infestation. The sole target variable is the summer infestation index zasel_l, defined for each pair

(y, i)

as (1):

Y_{y, i} = {z a s e l}_{l} (y, i)

(1)

where

Y_{y, i}

is interpreted as the integral index of crop infestation by the pest during the growing season (summer). This indicator directly reflects the actual phytosanitary load during the critical period for crop formation, and it is precisely this index that all constructed models predict. Additionally, the source data includes the variable kol_l, which describes the summer pest infestation (a quantitative indicator per unit area). In the study, kol_l was used in descriptive and correlation analyses (to assess the relationship between infestation level and infestation size), but was not the primary target in forecasting tasks.

For each

(y, i)

in the dataset, monthly mean temperature and precipitation values are available for the months April through August. Temperature variables include t_april, t_may, t_june, t_july, and t_august. Precipitation variables include p_april, p_may, p_june, p_july, and p_august.

Let

T_{m} (y, i)

denote the average monthly temperature in month

m

, and

P_{m} (y, i)

denote the corresponding precipitation indicator (or moisture index). Then, for example,

T_{Apr} (y, i) \leftrightarrow t_april (y, i), P_{Jul} (y, i) \leftrightarrow p_july (y, i)

(2)

and so on for all months

m \in {Apr, May, Jun, Jul, Aug}

.

Initially, the data also included autumn agrometeorological parameters, but they were excluded from the set of predictors for biological and agronomic reasons. The spring months (April and May) are retained in the model, since early spring conditions determine the success of pest overwintering, its emergence from diapause, early infestation development, and the condition of seedlings and young plants, which directly affects the subsequent summer infestation level. The summer months (June, July, August) reflect the climatic situation during the main phase of vegetation, when the actual level of crop damage is formed, and the

{z a s e l}_{l}

indicator is measured. Autumn parameters primarily characterize the infestation’s preparation for winter and affect next year’s infestation size, rather than the current summer damage, so they are not included in the model. As a result, the climatic predictor block is limited to the period April-August, which ensures the biological consistency of the problem statement and the relevance of the modeled forecast indicator. To take into account the intensity of surveys and the scale of the observed territory, the agronomic variable audan_l is included in the model—the area of surveyed crops in summer for each pair

(y, i)

:

A_{y, i} = audan_l (y, i) .

(3)

This variable was used as a predictor, as changes in survey area can influence the observed occupancy index (due to differences in monitoring intensity and area coverage). In addition to the initial monthly values, a set of aggregated seasonal indicators was created, reflecting the integral conditions of the spring-summer period and its intra-seasonal structure. The following indicators were constructed for temperature:

Average temperature of the early period (April–May):

t_{y, i}^{(early)} = \frac{T_{Apr} (y, i) + T_{May} (y, i)}{2}

(4)

Average temperature of the middle of the growing season (June–July):

t_{y, i}^{(mid)} = \frac{T_{Jun} (y, i) + T_{Jul} (y, i)}{2}

(5)

Temperature of the late period (August):

t_{y, i}^{(late)} = T_{Aug} (y, i)

(6)

Average temperature for the entire period April–August:

t_{y, i}^{(season)} = \frac{T_{Apr} (y, i) + T_{May} (y, i) + T_{Jun} (y, i) + T_{Jul} (y, i) + T_{Aug} (y, i)}{5}

(7)

Temperature range of the season (contrast of conditions):

t_{y, i}^{(range)} = \underset{m \in {Apr, \dots, Aug}}{m a x} T_{m} (y, i) - \underset{m \in {Apr, \dots, Aug}}{m i n} T_{m} (y, i)

(8)

Similar units have been built for precipitation:

Total precipitation of the early period (April–May):

p_{y, i}^{(early)} = P_{Apr} (y, i) + P_{May} (y, i)

(9)

Total precipitation during the middle of the growing season (June–July):

p_{y, i}^{(mid)} = P_{Jun} (y, i) + P_{Jul} (y, i)

(10)

Late period precipitation (August):

p_{y, i}^{(late)} = P_{Aug} (y, i)

(11)

Total precipitation for the entire period April–August:

p_{y, i}^{(season)} = \sum_{m \in {Apr, \dots, Aug}} P_{m} (y, i)

(12)

Maximum monthly precipitation for the season:

p_{y, i}^{(\max)} = \underset{m \in {Apr, \dots, Aug}}{m a x} P_{m} (y, i)

(13)

Variables strictly related to the growing season were used in statistical analyses and model training. The dataset includes the Year and ID identifiers (for the panel structure and correct temporal partitioning), climatic predictors for the months April–August (t_april–t_august, p_april–p_august), and aggregated seasonal indicators: temperature features (t_early_mean, t_mid_mean, t_late_mean, t_season_mean, t_season_range) and precipitation (p_early_sum, p_mid_sum, p_late_sum, p_season_sum, p_season_max). Additionally, we used the agronomic predictor audan_l (summer surveyed area), the main target variable

{z a s e l}_{l}

(summer colonization index), and the auxiliary indicator kol_l for extended descriptive analysis. This data structure provides a biologically sound, transparent, and reproducible sampling protocol for reliably predicting pest infestations throughout the growing season.

3.2. Hyperparameters and Model Configurations

To ensure fair comparison of models and prevent arbitrary configuration bias, hyperparameters for all machine learning models were selected using a structured cross-validation procedure. For tabular models (Ridge, Random Forest, and XGBoost), a predefined hyperparameter grid was constructed based on generally accepted ranges reported in the literature, as well as a preliminary sensitivity analysis conducted on the training data. Model selection was conducted using 8-fold time-adjusted cross-validation, in which each fold corresponded to a lagged test year to preserve the temporal structure of the panel dataset. The primary selection criterion was the mean root mean square error (RMSE) of the validation set, calculated across all folds. Additionally, the generalization gap R², defined as the difference between the R² values on the training and validation sets, was monitored to assess overfitting and model stability. The final hyperparameter configuration for each model was chosen as the one that yielded the lowest average validation RMSE, while simultaneously demonstrating consistent performance across all folds and a minimal generalization gap in R². This strategy ensured not only accuracy but also robustness of the selected models to the temporal variability of agroclimatic conditions. The resulting hyperparameter values reflect a balanced tradeoff between expressive power and generalization stability, which is particularly important for panel agroclimatic data characterized by limited temporal depth and interannual variability.

The hyperparameter optimization procedure described above yielded a set of stable and well-generalizable configurations for each table model. After evaluating all possible parameter combinations using time-dependent cross-validation, the best configuration for each model was selected according to predefined selection criteria. Table 1 lists the final optimal hyperparameter values used in the comparative experiments.

As shown in Table 1, the selected configurations reflect a stable balance between model complexity and generalization stability. For ridge regression, a low regularization strength is preferred, indicating that a moderate L2 penalty is sufficient to control multicollinearity without suppressing informative climate predictors. For random forest, the optimal structure combines a sufficient number of trees with feature downsampling and leaf-level regularization, reducing variance while maintaining nonlinear modeling ability. For XGBoost, the selected configuration emphasizes conservative boosting with a low learning rate and shallow trees, ensuring gradual error minimization and preventing overfitting across observation years. Overall, the selected hyperparameters demonstrate that model stability and temporal robustness were prioritized over excessive structural complexity, supporting the validity of the subsequent performance benchmarking.

A summary of the key hyperparameters and structural features of the models used is given in Table 2. All neural network models used the Adam optimizer with adaptive learning rates, early stopping based on validation MSE, and small mini-batch sizes to account for the limited nature of the annual series.

Taken together, the presented hyperparameters and architectural solutions ensure comparable training conditions across all models and provide a unified, reproducible technical basis for subsequent comparisons. The use of standardized features, the Adam optimizer, an early stopping mechanism, fixed random seeds, and unified input data structures eliminates the influence of configuration artifacts on the final metrics and objectively evaluates the contribution of each architecture to forecast quality. This approach guarantees the correctness of subsequent experimental analysis and ensures transparent interpretation of the differences observed between linear, ensemble, recurrent, and hybrid models.

3.3. Experimental Design and Model Training

To ensure methodological transparency and prevent data leakage over time, the dataset was split using a strictly chronological time-based partitioning strategy. Because the data are panel-based with annual observations from 2005 to 2022, random shuffling was not applied. For the main experimental evaluation, the dataset was divided into training and test subsets according to time order. In each validation iteration, one year was allocated as the test set, and all previous years were used for model training. This rolling-start evaluation scheme preserves the natural time dependence of agroclimatic processes and reflects realistic forecasting conditions. For the tabular models, 8-fold time-based cross-validation was implemented, with each fold corresponding to a separate dedicated test year. A fixed chronological partition was used for diagnostic analysis of the data distribution, visualization of forecasting behavior, and final evaluation beyond the time domain. In this configuration, the period 2005–2018 was used for training (3500 observations over 14 years and 250 regions), 2019–2020 for validation (500 observations over 2 years and 250 regions), and 2021–2022 for final testing (500 observations over 2 years and 250 regions). The neural and hybrid time series models were trained using sliding window sequences constructed only from past observations, ensuring that future information was not accessed during model fitting. This protocol ensures that the model evaluation strictly reflects the forecasting results outside the training set in real-world forecasting conditions and eliminates information leakage between training and testing periods. To further demonstrate the strictly chronological validation strategy, Table 3 provides a brief description of the floating-origin cross-validation protocol used in this study. Each iteration corresponds to one lagged test year, while all previous years were used exclusively for model training.

As shown in Table 3, the training set always consists of historical observations preceding the test year. Future information was not considered when constructing the model. This design corresponds to real-world forecasting scenarios, where forecasts for a given year can be based only on past climate and phytosanitary data.

A combination of descriptive statistics and correlation analyses, post hoc SHAP interpretation, and comparative testing of 10 models confirms that summer pest infestations are determined by a combination of early-spring temperature conditions, moderate moisture anomalies, and complex interannual interactions among climatic factors. Hybrid architectures combining wavelet decomposition, recurrent layers, and attention mechanisms achieve significant improvements in accuracy over traditional models, fully reproducing both background and extreme index dynamics. These results indicate high potential for implementing the proposed architecture in intelligent monitoring and early warning systems.

Table 4 presents the final performance evaluation results after fixing the model parameters on three predefined chronological subsets: training, validation, and test. The results show that classical ensemble methods such as Random Forest and XGBoost demonstrate very high fitting accuracy on the training subset but generalize poorly to the validation and test datasets, indicating significant overfitting under small sample conditions in the temporal context. In contrast, the baseline MLP model demonstrates the most stable generalization ability, achieving the best results on the test data (MSE = 0.0013, RMSE = 0.0355, MAE = 0.0302, R² = 0.7367). Among the wavelet-based hybrid architectures, Wavelet_CNN_SARIMA achieves the highest validation accuracy (R² = 0.8790), while Wavelet_CNN_GRU demonstrates the most balanced behavior on the test data among the wavelet-based models (R² = 0.0200), although its predictive power remains lower than that of the baseline MLP model on the final test subset. Wavelet-based models use a 3-year window, meaning the effective number of training years to be evaluated is 11 instead of 14.

Table 5 presents the Ljung–Box test statistics and p-values for the SARIMA residuals across multiple lags. In all cases, the p-value remains significantly above the standard significance threshold of 0.05 (0.539, 0.202, 0.310, 0.434), so the null hypothesis of no residual autocorrelation is rejected. This means that after fitting the SARIMA, the residuals do not retain a statistically significant linear time dependence across the tested lags, and the trend/structure revealed by the SARIMA component generally adequately “cleans up” the linear portion of the signal. Therefore, further quality improvements in the hybrid architecture should come primarily from nonlinear modules (CNN/GRU/Transformer), which model the remaining complex dynamics and interannual effects.

As part of the experimental phase, several classes of models for predicting the summer pest infestation index

z a s e l_{l}

were trained and compared: linear regression-based models, tree-based ensembles, a single-layer fully connected neural network (MLP), and a family of hybrid wavelet-temporal models combining classical ARIMA/SARIMA approaches with convolutional and recurrent neural networks and a spatio-temporal Transformer block. All models solved a univariate regression problem for each feature vector

x_{i}

(or sequence

X_{t}

for temporal models). It was necessary to estimate the conditional mathematical expectation of the target variable

y_{i} = z a s e l_{l}

. The root-mean-square error was used as the loss function in all cases, ensuring model comparability based on minimizing the variance of the forecast error. For tabular models, the initial features were standardized using Z-score scaling, and for stochastic algorithms, the random number generator seed was fixed for experimental reproducibility.

Linear regression and Ridge regression served as simple, interpretable benchmarks. Both models approximated the relationship between a

p

-dimensional feature vector

x_{i} \in R^{p}

and a target variable

y_{i}

as (14):

{\hat{y}}_{i} = β_{0} + x_{i}^{⊤} β,

(14)

where

β_{0}

is the free term, and

β

is the vector of coefficients. Parameter estimation in ordinary linear regression was performed using the least squares method. For Ridge regression, the problem was solved:

\underset{β_{0}, β}{m i n} \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - β_{0} - x_{i}^{⊤} β)^{2} + λ ∥ β ∥_{2}^{2}

(15)

where

λ = α

is the L2 regularization coefficient, which suppresses overfitting by penalizing large weights. A random forest is a tree-shaped ensemble, and XGBoost is a gradient-boosting method. In the random forest, the final prediction was calculated as an average over a set of

M

independent decision trees:

\hat{y} (x) = \frac{1}{M} \sum_{m = 1}^{M} T_{m} (x)

(16)

where

T_{m}

is the

m

-th tree trained on a bootstrap sample using a random subset of features at each node. The XGBoost model was the sum of sequentially added trees:

{\hat{y}}_{i}^{(K)} = \sum_{k = 1}^{K} f_{k} (x_{i})

(17)

where each tree

f_{k}

was selected in the direction from the gradient of the loss function with L2-regularization at the leaves, a fully connected neural network (MLP) implemented a nonlinear mapping, where θ is the layer parameters. The architecture included three hidden fully connected layers with ReLU activations and Dropout regularization, followed by a linear output neuron for regression.

Hybrid models were run with an annual time series of Kazakhstan’s average

z a s e l_{l}

index values, denoted by

y_{t}

,

t = 1, \dots, T

(years from 2005 to 2022). To extract multiscale components, a discrete wavelet transform with the Daubechies-4 (db4) wavelet and a decomposition level

L = 1

was used. As a result, the original signal

{y_{t}}

is represented as (18):

y_{t} = a_{t} + d_{t}

(18)

where

a_{t}

is the smoothed (approximating) component,

d_{t}

and is the detailed high-frequency component. Next, a three-channel feature

s_{t} = [y_{t}, a_{t}, d_{t}]^{⊤}

with a sliding window of length

q = 3

years was formed:

X_{t} = [s_{t - q}, s_{t - q + 1}, \dots, s_{t - 1}] \in R^{q \times 3}, t = q + 1, \dots, T

(19)

Wavelet_CNN_ARIMA and Wavelet_CNN_SARIMA. First, a baseline ARIMA (1, 0, 0) or SARIMA (1, 0, 0)

\times {(1, 0, 0)}_{s = 3}

model was trained on the

{y_{t}}

series, providing linear modeling of the autocorrelation structure. Let

{\hat{y}}_{t}^{base}

be the model’s prediction. Next, the residuals were constructed:

r_{t} = y_{t} - {\hat{y}}_{t}^{base}

(20)

which were approximated by a convolutional network

g_{ϕ} (\cdot)

on wavelet features

X_{t}

. The final hybrid forecast had the form:

{\hat{y}}_{t} = {\hat{y}}_{t}^{base} + g_{ϕ} (X_{t})

(21)

The convolutional block included two Conv1D layers (32 and 16 filters, kernel 2, ReLU, “same” padding), intermediate max-pooling, a Flatten layer, a fully connected layer of 32 neurons with ReLU and Dropout, and a linear output neuron for residual prediction.

Wavelet-CNN-LSTM and Wavelet-CNN-GRU. These models did not use an ARIMA/SARIMA basis; the neural network directly mapped the

X_{t}

sequence to the target prediction:

{\hat{y}}_{t} = f_{θ} (X_{t})

(22)

The same two-layer Conv1D block was used as the frontend, followed by a recurrent LSTM or GRU layer with 32 units and dropout, and then a linear regression output. This design allows for simultaneous consideration of local wavelet patterns and long-term temporal dependencies.

Wavelet_CNN_SARIMA_GRU_STTransformer (multi-branch hybrid). For the most complex architecture, a residual learning scheme with respect to the SARIMA basis was used:

{\hat{y}}_{t} = {\hat{y}}_{t}^{SARIMA} + h_{ψ} (X_{t})

(23)

where

h_{ψ}

is a multi-branch neural network block consisting of three parallel branches:

(i): A CNN branch similar to Wavelet_CNN_SARIMA;
(ii): A CNN–GRU branch, supplementing convolutions with a GRU layer to capture sequential patterns;
(iii): A spatial-temporal Transformer block, including a dense projection into a $d_{model} = 64$ -dimensional space, a multi-head attention layer, and a two-layer positional feed-forward block with L2 regularization. The vector representations from the three branches were concatenated and fed into a two-layer fully connected “head” block (ReLU + Dropout), which generates the residual prediction $h_{ψ} (X_{t})$ . This architecture combines the advantages of the linear seasonal SARIMA basis and nonlinear feature extraction from a wavelet sequence.

Figure 1 shows the organization of the proposed multi-branch hybrid architecture for processing the wavelet-transformed sequence of annual infestation index values,

wavelet_seq \in R^{T \times F}

. Branch 1 (Wavelet_CNN_SARIMA) implements a convolutional extractor of local temporal patterns: two Conv1D layers with a

k = 2

kernel. Subsequent gap averaging (GAP) forms a compact representation, which is transformed through fully connected layers (Dense 32 → Dropout 0.2 → Dense 16) into a feature vector

b 1_o u t

, consistent with the residuals of the base SARIMA model. The second branch (Wavelet_CNN_GRU) takes the same wavelet sequence as input and, after the Conv1D–MaxPool–Conv1D block, uses a recurrent GRU (32) layer to capture longer temporal dependencies. Its output is further regularized by a Dropout layer and compressed into a 16-dimensional vector (Dense 16), forming the representation

b 2_o u t

. The third branch implements a spatiotemporal Transformer: a linear projection to dimension

d_{model}

, a normalization layer, a MultiHeadAttention block, and a positional feed-forward block with L2 regularization and Dropout create an aggregated description of the series dynamics, which, after global time averaging and two fully connected layers (Dense 32 → Dropout → Dense 16), yields the vector

s t_o u t

.

The lower part of the diagram shows the feature fusion block: the vectors

b 1_o u t

,

b 2_o u t

, and

s t_o u t

are concatenated and sequentially passed through two fully connected layers with ReLU activation (32 and 16 neurons) and a Dropout layer. The final Dense (1, linear) layer produces a residual prediction

{\hat{r}}_{t}

, which is then combined with the base SARIMA modeling trajectory. Thus, the depicted architecture combines local convolutional features, recurrent sequence modeling, and the Transformer attention mechanism, enabling more complete exploitation of the time series’ latent structure for pest infestation index analysis and prediction.

4. Results

4.1. Statistical and Visual Analysis of Data Structure

To quantitatively assess the relationship between agroclimatic conditions and pest infestation in grain crops, a panel dataset was created, with the basic unit of observation being the “year–spatial unit” pair. For each year and region (or agrounit), average monthly temperature and precipitation values were recorded for the period April–August, the area of surveyed crops in summer, as well as summer phytosanitary indicators, including the infestation index and pest abundance. This data format allows for the simultaneous consideration of interannual dynamics and the spatial heterogeneity of pest damage and serves as the basis for subsequent statistical analysis and the training of predictive models.

Table 6 shows that all climate indicators are generated strictly for the growing season and include only the months from April to August. Autumn indicators are absent, as they reflect the infestation’s preparation for winter and influence the dynamics of the following year, but are not informative for predicting current summer infestation levels. The target variables for the summer period are the infestation index

{z a s e l}_{l}

, which serves as the primary target for modeling, and the kol_l indicator, used for descriptive analysis and biological interpretation. The additional indicator audan_l accounts for the influence of the surveyed area on the recorded infestation level, thereby increasing the accuracy of subsequent analysis.

Each row corresponds to a unique combination of observation year (Year) and spatial identifier (ID). For each pair (

Year, ID

), the average monthly temperature (t_*) and precipitation (p_*) for April-August, the area of surveyed crops in summer (audan_l), the summer infestation index (

{z a s e l}_{l}

), and the pest infestation (kol_l) are presented. The first five observations are shown; the full dataset contains 4500 rows. Table 7 shows the number of observations (count), mean (mean), standard deviation (std), minimum value (min), quartile values (25%, 50%, 75%), and maximum (max) for each variable. This describes the range and variability of agroclimatic conditions and phytosanitary indicators used in the modeling.

The data cover 18 years (2005–2022) and include approximately 4500 observations, which provide a reliable assessment of interannual dynamics. Average monthly temperatures (t_april–t_august) demonstrate the expected seasonal profile for Kazakhstan: spring values are in the range of ~15–21 °C, and the summer maximum is reached in July (~27 °C) with moderate interannual variability. Precipitation indices (p_april–p_august) vary mainly within the range [0, 1], reflecting significant fluctuations in humidity between months, which is essential for the formation of favorable or unfavorable conditions for pest development. The surveyed area indicator (audan_l) varies widely, from 0.001 to 141.367, indicating significant differences in the monitoring scale across regions and years, and is therefore used as one of the predictors. The distributions of the target variables zasel_l and kol_l exhibit pronounced right-hand asymmetry: the medians remain relatively low, while the maxima can be very high, indicating rare but intense pest outbreaks. This factor is taken into account when choosing models and analyzing predictive errors.

To ensure the validity of the evaluation protocol across time and to check for unintended distribution shifts, descriptive statistics were calculated separately for the training, validation, and test sets. Because the dataset was split chronologically, differences between splits may reflect genuine interannual variability rather than random selection effects. Therefore, statistical comparisons across splits provide an additional diagnostic layer for assessing the consistency of the dataset and the generalization of the model. Table 8 presents summary statistics for the target variable and key climate predictors across the three data partitions. The comparison confirms that the climate predictors remain largely comparable across sample splits. Average temperature values for April–June exhibit only moderate fluctuations, and precipitation indices maintain similar ranges and scatter, indicating a stable distribution of climate parameters throughout the observation period. In contrast, the target variable (zasel_l) exhibits noticeable differences in extreme values. Although the median remains virtually identical across sample splits (≈0.5), the training set contains significantly higher maximum values compared to the validation and testing periods. This suggests that rare but intense outbreaks were observed in earlier years, which are less common in later seasons.

Given the central role of early- and mid-season agroclimatic conditions in shaping summer pest dynamics, a targeted statistical analysis of key climatic factors was conducted for the period from April to June. These months correspond to critical phases of the early growing season and early biological activity, during which temperature and humidity conditions can significantly influence pest development and survival rates. Table 9 presents descriptive statistics for the selected climatic variables, allowing for a structured assessment of their central tendencies, variability, and observed ranges across the entire panel dataset.

In addition to monthly temperature and precipitation indices (April–August), aggregated seasonal characteristics were constructed to reflect conditions at key stages of the growing season. For temperature, mean values were calculated for the early (

t_{early_mean}

), middle (

t_{mid_mean}

), and late (

t_{late_mean}

) growing-season subsegments, as well as the mean temperature for the entire period April–August (

t_{season_mean}

) and its intraseasonal range (

t_{season_range}

). Similarly, for precipitation, summary indices were obtained for the early (

p_{early_sum}

), middle (

p_{mid_sum}

), and late (

p_{late_sum}

) parts of the season, the total precipitation for April–August (

p_{season_sum}

), and the maximum monthly value within the season (

p_{season_\max}

). These aggregates provide a compact description of integral and extreme climatic conditions that may affect pest development and survival and are used in subsequent regression and machine learning analyses as predictors of summer infestation.

Table 10 shows that the mean seasonal temperature

t_{season_mean}

is within a relatively narrow range (22–24 °C, with a moderate standard deviation), while the intraseasonal temperature range

t_{season_range}

exhibits significantly higher variability, reflecting years with abnormally cold spring months or hot summer months. The following columns correspond to these features: temperature aggregates—

t_{early_mean}

,

t_{mid_mean}

,

t_{late_mean}

,

t_{season_mean}

,

t_{season_range}

, and precipitation aggregates—

p_{early_sum}

,

p_{mid_sum}

,

p_{late_sum}

,

p_{season_sum}

,

p_{season_\max}

. All the listed derived variables were used both in descriptive and correlation analyses (to study distributions and their relationships with the

{z a s e l}_{l}

indicator) and as additional input features in constructing models to predict the summer infestation index.

The total precipitation indices

p_{early_sum}

,

p_{mid_sum}

, and

p_{season_sum}

indicate a noticeable spread between relatively dry and wet seasons. At the same time, the

p_{season_\max}

indicator characterizes the intensity of the maximum monthly precipitation episode within the growing season. Together, these derived indicators allow for a more compact and biologically meaningful incorporation of the seasonal climate background into models for predicting summer pest infestations. As part of the spatial data analysis, long-term variability in the summer pest infestation index between administrative districts was assessed. For each region, the average summer pest infestation index was calculated for 2005–2022, and 15 areas with the highest values were selected. This ranking allows us to identify persistent “hot spots” of phytosanitary risk and to verify the extent to which the training and validation samples cover the full range of spatial conditions, including both high- and low-infestation areas.

Figure 2 shows the districts with the highest average summer infestation index

{z a s e l}_{l}

, aggregated across the entire study period. The leading regions (Esil, Zhaksy, Albasar, Kostanay, and Osakarovka) form a group of territories with significantly higher infestation levels, while areas on the right side of the diagram (e.g., Balkashino and Astana) demonstrate significantly lower average index values.

The column captions show precise long-term averages, highlighting the multiple differences between individual areas. This spatial ranking is used both to interpret biogeographic patterns of pest infestation and to subsequently analyze the robustness of forecasting models to varying levels of pest pressure. To assess the statistical properties of the target variable, a primary univariate analysis of the summer pest infestation index

{z a s e l}_{l}

was conducted. Particular attention was paid to the distribution shape and the presence of extreme values, as these features determine the choice of quality metrics, the robustness of regression models, and the need for transformations or robust methods. Figure 3 shows the distribution of

{z a s e l}_{l}

index values as a boxplot, where the central box represents the interquartile range (from the 25th to the 75th percentile), the horizontal line inside represents the median, and the whiskers represent the “typical” range of observations. As shown in the graph, most values are concentrated near the lower end of the scale. At the same time, numerous outliers are found above the upper end, including isolated extreme outbreaks with indices above 70–100. This pronounced right-hand asymmetry and the presence of high-tail values indicate rare but very intense pest outbreaks, which we interpret as genuine episodes of high load rather than measurement artifacts. This result was taken into account when selecting the combination of metrics (MSE, RMSE, MAE, and

R^{2}

) and when developing hybrid models capable of adequately reproducing both the “background” level of infestation and rare extreme events.

To quantify the contribution of agroclimatic factors to variation in the summer pest infestation index, a linear correlation analysis was first performed. Pearson correlation coefficients were calculated for the training set between average monthly temperature and relative moisture indices in April–August and the target variable,

{z a s e l}_{l}

. The resulting values were used as the primary criterion for feature significance in the subsequent construction and interpretation of hybrid models. Figure 4 shows horizontal bar graphs displaying the Pearson correlation coefficients between average monthly temperatures (upper panel) and precipitation indices (lower panel) for April-August and the

{z a s e l}_{l}

index.

Temperature indicators exhibit predominantly negative relationships with the target variable; the highest absolute correlation is observed for April temperature, underscoring the critical role of early spring conditions in subsequent pest infestations during the growing season. May–July temperatures are also associated with the infestation index, but to a somewhat lesser extent, and the influence of June temperature is close to zero, indicating a more complex, possibly nonlinear, infestation response in mid-season.

Precipitation predictors are characterized by a relatively weak negative correlation with

{z a s e l}_{l}

: July precipitation makes the most significant contribution, while humidity indicators in other months have a moderate moderating effect on infestation density. Taken together, these patterns confirm that early spring temperatures are the leading linearly related risk factor, while precipitation plays a weaker but systematic modifying role.

Figure 5 shows how the XGBoost model redistributes “weight” between monthly climate features: the top panel shows temperature predictors, and the bottom panel shows humidity indicators. The diagram clearly demonstrates that among the temperatures, April temperature makes the most significant contribution to forecast quality, followed by June and August, while May and July play a more supporting role.

In the precipitation group, June has the highest importance, followed by April and August, with May and July precipitation being less significant. This importance distribution suggests that the model relies primarily on early spring temperature background and early summer moisture conditions to reconstruct the interannual variability of the

{z a s e l}_{l}

index, while the other months contribute additional, but less critical, information. Figure 6 shows how the mutual information between

{z a s e l}_{l}

and climate indicators are distributed across months during the growing season. The top panel shows that the most information about the target index is contained in April and May temperatures, while the contribution of summer temperatures (especially June and August) is significantly lower, indicating the critical role of early spring thermal conditions in determining subsequent pest infestations.

The bottom panel shows the mutual information values for precipitation indices: April and May precipitation prove to be the most informative, while June-July-August are characterized by lower values. Taken together, these results indicate that, for both temperature and moisture, the dominant influence on summer pest infestation index variations occurs during the preceding spring period, which justifies the inclusion of these features in the forecast models. To analyze the contribution of individual temperature predictors to the summer pest infestation index forecast, a post hoc analysis of the XGBoost model was performed using the SHAP (SHapley Additive exPlanations) method. This approach allows us to move from a “black box” to a quantitative assessment of how specific temperature values in different months of the growing season bias the forecast toward an increase or decrease in infestation.

Figure 7 shows how individual observations for t_april-t_august (color scale from low to high) are distributed along the SHAP axis, which reflects the feature’s contribution to the model output. Points shifted to the right correspond to increases in the predicted infestation index, while points shifted to the left correspond to decreases relative to the baseline. The widest “cloud” of points is observed for t_april, indicating the dominant influence of April temperature on forecast variability: high April temperatures (crimson) are predominantly associated with negative SHAP values, meaning they reduce expected infestation, while cooler years (blue) are more often associated with positive prediction bias. For t_may-t_july, the contribution remains noticeable, but smaller in amplitude. For t_june and t_august, the spread of SHAP values is close to zero, indicating a relatively weak effect of these months in the temperature-oriented XGBoost module compared to April conditions.

To assess the contribution of monthly precipitation indices to the summer pest infestation forecast, XGBoost gradient boosting was interpreted using the SHAP (Shapley Additive exPlanations) methodology. This analysis allows us to track which combinations of moisture conditions in April–August cause the model to adjust the expected infestation level upward or downward relative to the average scenario. Figure 8 shows the distribution of SHAP values for the p_april-p_august predictors, which characterize precipitation in individual months of the warm period. Each point represents one observation: the position on the horizontal axis shows the contribution of a specific predictor value to the XGBoost model output (SHAP value), and the color scale reflects the magnitude of the initial precipitation index (from drier conditions to abnormally wet periods).

The most incredible spread of SHAP values is observed for p_april and p_may, meaning that spring precipitation most often leads to a significant adjustment in the predicted summer infestation level, while precipitation values in June-August generally produce small absolute shifts around zero. This diagram indicates that the humidity regime at the beginning of the growing season plays a more sensitive role in modeling infestation risk, while the overall effect of precipitation remains less pronounced than that of temperature factors and acts more as a modifying factor than a dominant driver of the model.

4.2. Evaluation of the Model’s Forecasting Quality

The evaluation was conducted on a lagged time sample formed from observation years, allowing us to test the models’ ability to reproduce previously unused years. Standard regression metrics were calculated for all approaches: mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), and the coefficient of determination (R2). The lower the MSE/RMSE/MAE values and the higher the

R^{2}

, the more accurately the corresponding model approximates the actual infestation index. The comparative analysis includes five “flat” baseline models (linear and ridge regression, random forest, XGBoost, and a fully connected neural network (MLP)) and five specialized temporal architectures using the wavelet transform of the annual series and/or hybridization with ARIMA/SARIMA, recurrent units (LSTM, GRU), and the proposed spatiotemporal transformer. This allows us to evaluate how much the gradual architectural complexity and the explicit modeling of the time structure of the series improve forecast accuracy compared to traditional machine learning methods.

To complement the importance analysis for each model and ensure a unified interpretable framework, SHAP (SHapley Additive ExPlanations) values were calculated for all tabular models. Unlike raw coefficient values or impurity-based estimates, SHAP values quantify the marginal contribution of each predictor to the model’s output in a consistent and theoretically sound manner. This allows for direct comparison of feature influence across models, independent of internal algorithmic mechanisms. Figure 9 presents a normalized SHAP importance matrix for different models, where each cell reflects the mean absolute SHAP value for a given feature for a specific model. Normalization ensures comparability across models with different output scales and allows for the identification of stable, consistently influential predictors.

Figure 10 shows the SHAP-based feature importance for the proposed hybrid branching model, focusing on the ten most influential predictors. Values represent normalized mean absolute SHAP contributions, allowing for a direct assessment of the influence of each variable on the model output. The results indicate a clear dominance of recent temporal components. The largest contribution is observed for lag_1_raw, followed by lag_1_approx, confirming that the most recent raw and approximated signals provide the greatest predictive power. Variables corresponding to the second lag (lag_2_raw and lag_2_approx) also demonstrate a significant influence, although their contribution gradually decreases with temporal distance. Features associated with the third lag demonstrate noticeably lower importance, while the detail components (lag_*_detail) contribute only insignificantly to the final prediction. This pattern suggests that the hybrid architecture effectively prioritizes short-term temporal dynamics, while long-term and high-frequency, granular components play a supporting but secondary role. Overall, the SHAP analysis confirms that the proposed hybrid branching model captures a hierarchical temporal structure, emphasizing that the primary factors determining forecast accuracy are recent aggregated signals.

Table 11 shows how the quality metrics differ across the entire range of tested models. Linear and ensemble algorithms (LinearRegression, Ridge, RandomForest, XGBoost, MLP), which work directly with agroclimatic features, provide a moderate explanation of the infestation index variation (

R^{2} \approx 0.53 - 0.68

), with significantly higher errors. Incorporating a wavelet decomposition of the annual series and residual learning (Wavelet_CNN_ARIMA, Wavelet_CNN_SARIMA models) significantly reduces RMSE and MAE, reflecting the benefit of explicitly modeling the temporal structure. Even greater gains are achieved with recurrent heads (Wavelet-CNN-LSTM, Wavelet-CNN-GRU). The highest accuracy is demonstrated by the proposed multi-branch hybrid model Wavelet_CNN_SARIMA_GRU_STTrans, for which the minimum error values are recorded (RMSE = 0.0223; MAE = 0.0178) and the maximum determination coefficient

R^{2} = 0.9466

, which indicates an almost complete reproduction of the dynamics of the summer infestation index in the validation sample.

Further analysis revealed that the differences in forecast performance are consistent with how each model exploits the structure of the input data and its temporal smoothness. A brief discussion of the performance of all approaches is provided below. Linear regression and ridge regression with α = 0.1 yield virtually identical values for MSE, RMSE, MAE, and

R^{2}

. This indicates that the feature matrix is well-conditioned and the level of multicollinearity between agroclimatic parameters is low: soft L2 regularization barely changes the OLS solution. However, the limitations of the linear approximation prevent these models from fully capturing the nonlinear response of pest infestations to temperature and precipitation extremes, which is reflected in a moderate

R^{2} \approx 0.68

.

Random forests and XGBoost, despite their more flexible, tree-like structure, are inferior to linear models in terms of accuracy (especially XGBoost, which has the worst

R^{2}

). This can be explained by a combination of two factors: first, the relatively small size of the training set across years, when rich ensemble models begin to overfit and adjust to noise; second, the high aggregation of features (annual means and sums), which means that complex nonlinear partitions of the feature space do not provide significant additional gains compared to a simple linear relationship. As a result, trees capture local fluctuations but do not enhance the model’s predictive ability in later years. The fully connected MLP neural network achieves RMSE and

R^{2}

values close to those of ensembles, while its mean absolute error is slightly higher. With a relatively small number of annual observations, MLP, like other highly parametric architectures without explicit consideration of temporal structure, is sensitive to the choice of initial weights and to random noise in the target variable. The lack of built-in mechanisms for modeling sequences (recurrent or time convolutional) means the neural network solves a static regression problem, limiting its potential.

The Wavelet-CNN model with an ARIMA base (Wavelet_CNN_ARIMA) demonstrates a significant performance improvement: a sharp decrease in RMSE and an increase in

R^{2}

to 0.81. This result is logical, since the two-stage “ARIMA + residual CNN” scheme decomposes the problem into a forecast of the smooth component of the time series (base ARIMA) and a forecast of high-frequency deviations (wavelet convolutions). ARIMA well describes the inertial trend and weak autocorrelation of the annual index, and the convolutional block compensates for local anomalies associated with individual extreme seasons.

Replacing ARIMA with SARIMA in the Wavelet_CNN_SARIMA model results in a slight performance degradation. Given the relatively short annual dataset, introducing a seasonal component with 3 years increases the number of parameters and complicates model estimation. Given a limited number of observations, this can lead to overfitting and less stable seasonality estimates. As a result, the residuals fed to the CNN branch become noisier, making it more difficult for the convolutional part to recover the sound signal.

Adding recurrent layers to the Wavelet-CNN-LSTM and Wavelet-CNN-GRU architectures partially improves the results compared to purely residual models. LSTM and GRU accumulate information from several previous years, thereby helping the CNN unit align high-frequency fluctuations with longer-term trends. The LSTM head yields a slightly higher

R^{2}

. At the same time, the GRU variant achieves a somewhat lower MAE, consistent with the known properties of these units: LSTM has better long-term memory retention. At the same time, GRU is more compact and less prone to overfitting on small samples.

The best performance is achieved by the proposed multi-branch hybrid model Wavelet_CNN_SARIMA_GRU_STTrans. Here, the base SARIMA reproduces the regular component of the series, the first CNN branch refines local wavelet characteristics, the second CNN-GRU branch captures medium-term dynamics, and the spatiotemporal transformer focuses on complex long-term dependencies and interactions across scales. Joint training of these three representations and subsequent fusion in a standard fully connected block enables multi-threaded signal processing and effective noise suppression. This multi-level decomposition leads to minimal error values (RMSE and MAE are an order of magnitude lower than those of other models) and a very high

R^{2} \approx 0.95

, indicating near-optimal use of information about the temporal structure of the infestation index.

Table 12 presents the results of the ablation study, quantifying the contribution of key architectural modules to forecasting performance. The full Wavelet_CNN_SARIMA_GRU_STTransformer model demonstrates the best results (R² = 0.8866; RMSE = 0.0325) and serves as the benchmark configuration. A slight performance degradation after removing SARIMA indicates that wavelet coding combined with GRU is already able to capture significant nonlinear dynamics, while the model with SARIMA alone shows significantly worse results, confirming the limitations of purely linear modeling. Removing the ST-Transformer leads to a noticeable decrease in accuracy, highlighting the importance of the attention mechanism for integrating context dependence. The most significant reduction occurs when removing the wavelet decomposition, demonstrating the critical role of temporal and frequency feature extraction. Overall, the results confirm that stable prediction performance is achieved through the joint interaction of wavelet preprocessing, sequential modeling, and attention-based integration.

Figure 11 shows the change in the average summer infestation index of grain crops by the pest Phyllotreta vittula by the years of training (blue dots) and validation (green dots), as well as the corresponding predicted values of the linear regression model for the period 2019–2022 (orange dots).

The model generally reproduces the level and interannual variability of the population index, providing close agreement with observations in the validation interval and only moderately smoothing the series’s extreme values. As shown in Figure 12, the Random Forest model satisfactorily describes the dynamics of the population index during the training interval. Still, during the validation period (2019–2022), its predictive ability is limited: the predicted values systematically overestimate the observed index and poorly reflect interannual variability, smoothing out local minima and maxima.

This indicates the model’s insufficient adequacy for accurately predicting summer population density with the given configuration of agroclimatic predictors. As shown in Figure 13, the XGBoost model fits the training set satisfactorily. Still, its predictive ability over the 2019–2022 validation interval is limited: the predicted values systematically overestimate the observed population index and poorly reflect the actual interannual variability.

The model significantly smooths out fluctuations in the series, fails to reproduce declining dynamics in individual years, and exhibits a bias toward higher risk levels than suggested by monitoring observations. Taken together, the results for two tree-based models, Random Forest and XGBoost, show that decision tree ensemble algorithms in this setting do not provide sufficiently accurate forecasts of summer infestation of grain crops by the pest Phyllotreta vittula. This is likely due to a combination of the relatively small volume and high noise of the initial data, a weak and nonlinear relationship between agroclimatic predictors and the infestation index, and the tendency of tree-based models to overfit to limited samples with complex feature structures. As a result, they approximate training years well but produce biased, less robust estimates in independent validation years, making their use as the primary tool for operational forecasting risky without additional structure simplification or regularization. As shown in Figure 14, the baseline multilayer perceptron model satisfactorily approximates the summer population index levels during the training interval without significant overfitting and retains the general shape of the interannual dynamics.

Over the 2019–2022 validation interval, the predicted values are close to the observed values, but a systematic tendency to underestimate the index relative to actual monitoring estimates is noticeable across all years. The amplitude of interannual fluctuations in the forecast is also smoothed: the model underestimates peaks and slightly overestimates minimum values, reflecting the MLP’s tendency to smooth given the limited sample size and the high noise of the agroclimatic predictors.

Overall, the results in Figure 14 indicate that the basic multilayer perceptron can extract robust nonlinear relationships between a set of temperature and precipitation parameters and the population index. Still, with the current architecture and data volume, the model provides only moderately accurate forecasts. The presence of a systematic downward bias and the smoothing of interannual variability indicate the need for additional tuning (regularization, changing the network depth, revising the feature set) or a transition to more specialized architectures better suited to small, noisy agroclimatic time series.

Figure 15 shows the dynamics of the average annual index of summer infestation of grain crops by the pest Phyllotreta vittula in the years of training the Wavelet_CNN_SARIMA model (blue dots) and validation (green dots), as well as the corresponding predicted values for the period 2019–2022 (orange dots). The hybrid wavelet–convolution–SARIMA model satisfactorily reproduces both the level and the shape of the index’s interannual variability in the training interval, without significant overfitting, and preserves sharp increases and subsequent decreases in infestation. In the validation interval, the predicted values practically coincide with the observed estimates, demonstrating minor systematic deviations and correctly reflecting the weak downward trend and the decrease in the amplitude of interannual fluctuations.

This behavior indicates the successful combination of the advantages of wavelet decomposition, which allows the extraction of informative frequency components of the index; the parametric SARIMA component, which describes the temporal dependence and residual seasonality; and the convolutional block, which extracts nonlinear relationships between agroclimatic predictors and the transformed population series. Taken together, the results presented in Figure 15 demonstrate the high predictive power of Wavelet_CNN_SARIMA and its superiority in describing the annual population index dynamics compared to baseline linear, tree-based, and simple neural network models.

As shown in Figure 16, the hybrid Wavelet_CNN_ARIMA model accurately reproduces the interannual dynamics of the average annual summer population index in both the training interval and the 2019–2022 independent validation period. The predicted values (orange line) are virtually identical to the observed index estimates (green line), demonstrating only a slight systematic tendency toward slightly underestimating population levels in the final years of the series. Moreover, the model accurately captures the index’s smoothly declining trajectory and the amplitude of its interannual fluctuations, without over-smoothing or creating artificial extremes.

The results indicate that the combination of wavelet decomposition, a convolutional block, and an ARIMA component enables effective extraction of local temporal patterns in the population series, with a parametric description of the linear time dependence. Taken together, this provides higher predictive accuracy than basic linear, tree-based, and simple neural network models, particularly in reproducing the structure of interannual index variability with a limited time series length. As shown in Figure 17, the Wavelet_CNN_LSTM hybrid model reproduces the interannual dynamics of the average annual index with fair accuracy.

The predicted values (orange line) are located in proximity to the observed index estimates (green line), demonstrating only moderate smoothing of the amplitude of interannual fluctuations and a slight downward shift in the final years of the series. Moreover, the model accurately captures the weak downward trend and relative stability of the index during the validation period, without producing population peaks that are clearly over- or underestimated. This forecast quality demonstrates that the combination of wavelet decomposition, convolutional local pattern extraction, and the LSTM recurrent unit effectively accounts for both short-term and longer-term time dependencies in the population index dynamics with a limited time series length. Overall, the results for Wavelet_CNN_LSTM confirm the high predictive power of this architecture and its superior stability and accuracy compared to baseline linear, tree-based, and simple MLP models. However, the degree of smoothing of extreme values requires consideration when interpreting the risks posed by plant protection systems. As Figure 18 shows, the Wavelet_CNN_GRU hybrid model satisfactorily reproduces the interannual dynamics of the average annual summer infestation index of grain crops by the pest Phyllotreta vittula both in the training interval and during the 2019–2022 independent validation period. During validation, the predicted values (orange line) are close to the observed estimates (green line), slightly overestimating the index levels in the first years of the interval and smoothly converging to the actual values toward the end of the series. The model correctly reproduces the weak downward trend and relatively small amplitude of interannual fluctuations, without creating artificial spikes in infestation or excessively smoothing the dynamics.

The results demonstrate that the combination of wavelet decomposition, a convolutional block, and a GRU recurrent unit effectively extracts informative temporal patterns from short, noisy occupancy index data. The wavelet transform highlights significant frequency components, the CNN fragment generalizes local structures, and the GRU units capture both short- and long-term dependencies. Together, this ensures high predictive accuracy and robust estimates, surpassing basic linear and tree-based models while maintaining interpretable risk dynamics consistent with biological concepts. As shown in Figure 19, the proposed Wavelet-SARIMA-GRU Spatio-Temporal Hybrid model almost completely reproduces the dynamics of the average annual summer occupancy index over the entire observation interval.

During the training period, the model adequately captures both the index levels and the shape of interannual fluctuations, demonstrating neither obvious overfitting nor excessive smoothing of extremes. Over the 2019–2022 validation interval, the predicted values (orange line) are virtually identical to the observed index estimates (green line): there is no consistent upward or downward systematic bias, and the magnitude of the point errors is small and comparable to the uncertainty of the field observations themselves. According to the summary quality assessment results (metrics table), the Wavelet-SARIMA–GRU hybrid model exhibits the lowest error values (MSE, RMSE, MAE) and the highest R² determination coefficient among all the options considered, including linear regression, decision tree ensembles, basic MLP, and other wavelet-hybrid architectures. This advantage is explained by the combination of three complementary components: a wavelet decomposition, which identifies informative multiscale components of the index; a SARIMA block, which describes the linear trend-seasonal structure of the time series; and a recurrent GRU module, which captures residual nonlinear and long-term dependencies due to variations in agroclimatic factors and spatial heterogeneity. As a result, the model provides the best compromise between forecast accuracy and robustness of estimates, making it the most promising tool for operational forecasting of the seasonal risk of crop infestation by the Phyllotreta vittula pest in conditions with limited historical data.

The temporal estimation strategy used in this study strictly preserved the chronological structure of the data. Future years were never included in the training set, and each test year was estimated only using models trained on previous observations, without any temporal shuffling. For tabular machine learning models (Ridge, Random Forest, and XGBoost), this procedure was implemented as a rolling 8-fold cross-validation, where the training window was successively expanded, and each fold corresponded to a lagged test year. For hybrid and neural network models, estimation was based on rolling time sequences constructed solely from past observations, followed by a fixed, time-free validation on subsequent years. Consequently, although chronological logic was consistently preserved across all experiments and no time leakage occurred, the validation scheme was not identical for all model classes. Cross-validation was used for hyperparameter selection and stability assessment for tabular models, while hybrid and deep architectures were evaluated using training sequences containing only historical data, combined with a fixed-time test for time-inconsistency. This refinement addresses an apparent discrepancy between the previously described cross-validation procedure and the fixed-time evaluation used in the final comparative experiments.

5. Discussion

The study results demonstrate that the dynamics of summer pest infestations of grain crops are influenced by a combination of early-spring temperature conditions, interseasonal humidity anomalies, and spatial heterogeneity in phytosanitary pressure. The identified relationships are consistent with the biological mechanisms of pest development and are confirmed by the descriptive characteristics of the panel dataset (Table 6) and extended seasonal integral indicators (Table 7). Analysis of the distributions of the main variables (Figure 3) and the spatial ranking of regions (Figure 2) highlights pronounced heterogeneity and rare extreme outbreaks, making the forecasting task particularly complex and sensitive to the data structure.

The conducted correlation analysis (Figure 4) showed the dominant influence of April and May temperatures on variations in the summer infestation index, and mutual information (Figure 6) confirmed that early spring climatic conditions contain the most tremendous amount of prognostically significant information. These results are entirely consistent with the biological mechanisms of pest emergence from diapause and early infestation development. Additional analysis of feature importance in the XGBoost model (Figure 5) showed a similar distribution of influence: early temperature and the early spring precipitation regime are critical predictors of summer infestation size. The negative contribution of April temperature to the SHAP index suggests that abnormally warm conditions in early spring may reduce the summer invasion index. This effect can be explained by the biological characteristics of Phyllotreta vittula. The species overwinters as adults in soil or plant debris. Elevated April temperatures may prematurely activate overwintering individuals, increasing their metabolism before sufficient food resources become available. Such early activation can lead to energy depletion and reduced survival. Furthermore, sharp temperature fluctuations in early spring may disrupt the synchrony between insect emergence and host plant phenology. As a result, early warming may paradoxically reduce population density later in the season due to increased mortality and reproductive stress.

Model interpretation using SHAP (Figure 7 and Figure 8) further demonstrates how actual temperature and precipitation values bias the predicted occupancy upward or downward. The widest ranges of SHAP contributions are observed for t_april, p_april, and p_may, confirming the dominance of spring conditions in the model’s regression structure. These results are consistent with the findings of the correlation analysis and with the distributions of seasonal aggregates (Table 11), highlighting the internal consistency of the different interpretation methods.

In addition to analyzing the Pearson linear correlation, we compared linear dependencies with the SHAP feature importance derived from nonlinear machine learning models. While the Pearson correlation captures only monotonic linear relationships between predictors and the target variable, SHAP reveals nonlinear contributions and interaction effects, as explored by ensemble and deep learning architectures. The comparison reveals that several climate variables with moderate or weak linear correlations exhibit high SHAP importance in the hybrid model, indicating the presence of nonlinear threshold and interactive dynamics that are not detectable using traditional linear analysis. These results demonstrate that pest outbreak behavior is determined not only by linear dependencies but also by complex nonlinear interactions, and that SHAP-based interpretation provides additional and more comprehensive insights.

A comparative analysis of model accuracy (Table 5) demonstrates that traditional machine learning methods, linear regression, random forest, and XGBoost, are unable to reproduce temporal and interannual data dependencies effectively. Linear models oversimplify predictor interactions, ensembles are prone to overfitting on a limited time series, and MLPs do not exploit the sequential structure of the series. In contrast, hybrid temporal models, including Wavelet_CNN_ARIMA and Wavelet-CNN-LSTM/GRU, achieve significant improvements in accuracy by accounting for the multiscale structure of the temporal signal.

The most significant progress is achieved by the proposed multi-branch Wavelet–CNN–SARIMA–GRU–STTransformer architecture, which integrates the advantages of SARIMA models, convolutional networks, recurrent layers, and the Transformer attention mechanism (see Figure 1). It is the multi-level decomposition underlying seasonality, local high-frequency deviations, medium-term dependencies, and long-range interactions that ensures the lowest RMSE and MAE values (Table 11). Thus, the model demonstrates technological novelty, interpretability, and high robustness to noise.

Limitations and Future Work. Despite the high accuracy and robustness of the proposed Wavelet–CNN–SARIMA–GRU–STTransformer hybrid architecture, the study has several limitations, stemming from both the data structure and the specifics of the chosen modeling approach. First, the panel dataset used covers 18 years, ensuring reliable interannual dynamics. However, the time series remains relatively short for deep models, especially when analyzing rare extreme infestation spikes. This limits the possibility of more complex training of transformer blocks and may reduce the model’s generalization when climatic conditions deviate from historical ranges, which is especially important given increasing climate variability.

Secondly, the climate features include only monthly average temperature and humidity indices for April-August. This format reflects the general seasonal background but does not capture intra-month extremes, the duration of anomalies, the frequency of dry and wet episodes, or quasi-diurnal dynamics, which can have a significant impact on pest survival. Further inclusion of daily or decadal data, as well as extreme climate indices (e.g., the number of days with temperatures above threshold values), could improve the model’s ability to recognize rare but significant bioclimatic events.

Third, the architecture uses district-level data aggregation (ID), enabling comparisons across territories. However, the spatial component is only indirectly captured through stationary identifiers. The model does not utilize a fully functional geospatial data structure, such as distances between regions, coordinates of agro-unit centers, spatial patterns of pest distribution, or migration effects. The integration of geospatial layers or graph neural networks could significantly enhance the model’s ability to describe spatiotemporal interactions.

Fourth, the current study lacks data on specific agricultural practices (crop rotation, sowing time, insecticide application, and varietal characteristics). These factors often have a significant impact on pest abundance and may explain some of the variation currently interpreted as climate-related. The addition of management and biological predictors, along with remote sensing data (NDVI, NMDI, LST), would enable the development of a more comprehensive and practical model for operational monitoring. Future research areas include expanding the architecture by integrating multichannel remote sensing data, implementing graph-transformer models to account for spatial relationships, using ensemble MTL (multi-task learning) architectures for simultaneously predicting infestation and abundance, and constructing probabilistic models that return a forecast uncertainty interval, an essential characteristic for early warning systems. A promising direction is the creation of an online platform that updates the model in real time as new data becomes available, enabling dynamic pest risk management in precision farming systems.

The target variable exhibits a pronounced right-hand skewness, reflecting rare but severe pest outbreaks. This poses challenges for regression-based forecasting models, as standard loss functions (e.g., MSE, RMSE) primarily optimize for average performance and may underestimate extreme cases. Although the proposed hybrid model achieves high overall accuracy, forecasting uncertainty increases for such high-impact events due to the limited number of historical observations and high variance. Further improvements could include quantile regression to model the upper tail of the distribution, extreme value theory approaches for rare regimes, cost-sensitive loss functions to emphasize extreme cases, hybrid classification-regression strategies for outbreak detection, and uncertainty quantification to provide probabilistic forecasts. These avenues will improve the robustness and reliability of early warning systems under extreme agroclimatic conditions.

6. Conclusions

In this paper, a new multi-branch hybrid architecture, Wavelet–CNN–SARIMA–GRU–STTransformer, is developed and validated for forecasting the interannual dynamics of summer pest infestations of grain crops under conditions of agroclimatic variability. The model combines the advantages of a seasonal linear basis, multiscale wavelet decomposition, recurrent modeling of short- and medium-term dependencies, and transformer attention, ensuring high forecast accuracy and robustness to extreme values of the target variable. A systematic comparison of 10 models was conducted using a harmonized panel dataset (2005–2022), demonstrating that the proposed architecture achieves the lowest RMSE and MAE and the highest coefficient of determination, significantly outperforming traditional algorithms.

The results demonstrate the great potential of hybrid neural network models for developing intelligent systems to monitor and manage phytosanitary risks. High accuracy, reproducibility, and biological interpretability make the developed approach suitable for integration into digital precision farming platforms and for creating more reliable pest early warning systems. Furthermore, the architecture remains extensible: it can be adapted to predict other bioclimatic parameters, combined with remote sensing data, and used as a component of comprehensive agro-risk analysis systems. Thus, the proposed model makes a significant contribution to the development of technological solutions for sustainable agriculture and can serve as a basis for further research in the field of intelligent analysis of temporal agroclimatic data.

Author Contributions

Conceptualization, G.A., A.A., N.O., S.S. and N.T.; methodology, G.A., A.A., N.O., S.S. and N.T.; software, S.S. and N.T.; validation, G.A., A.A., N.O. and S.S.; formal analysis, G.A., A.A., G.M. and S.S.; investigation, G.A., A.A., N.O., S.S. and G.M.; resources, G.A., A.A., N.O., S.S., N.T. and G.M.; data curation, G.M.; writing—original draft preparation, G.A., A.A. and G.M.; writing—review and editing, A.A., N.O., N.T. and G.M.; visualization, S.S. and G.M.; supervision, G.A. and A.A.; project administration, G.A. and A.A.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Republic of Kazakhstan, IRN “AP 19675312 Analytical system for forecasting the dynamics of the number of pests of grain crops in Kazakhstan based on a neural network model”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RMSE	Root Mean Square Error
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
ARIMA	AutoRegressive Integrated Moving Average
Conv1D	One-dimensional Convolution Layer
GHI	Global Horizontal Irradiance
ETR	ExtraTerrestrial Radiation
DOY	Day Of Year
Grad-CAM	Gradient-weighted Class Activation Mapping
MSE	Mean Squared Error
LSTM	Long Short-Term Memory

References

Radočaj, D.; Gašparović, M.; Jurišić, M. Open remote sensing data in digital soil organic carbon mapping: A review. Agriculture 2024, 14, 1005. [Google Scholar] [CrossRef]
Diaz-Gonzalez, F.A.; Vuelvas, J.; Correa, C.A.; Vallejo, V.E.; Patino, D. Machine learning and remote sensing techniques applied to estimate soil indicators: A review. Ecol. Indic. 2022, 135, 108517. [Google Scholar] [CrossRef]
Serban, C.; Maftei, C. Remote Sensing Evaluation of Drought Effects on Crop Yields Across Dobrogea, Romania, Using Vegetation Health Index (VHI). Agriculture 2025, 15, 668. [Google Scholar] [CrossRef]
Yadav, S.; Sarangi, S.; Shafi, A.A.M.; Pandey, K.; Thodusu, M.; Soni, S.; Parmar, S. Climate change and insect ecology: Impacts on pest populations and biodiversity. J. Adv. Microbiol. 2024, 24, 103–118. [Google Scholar] [CrossRef]
Kabato, W.; Getnet, G.T.; Sinore, T.; Nemeth, A.; Molnár, Z. Towards climate-smart agriculture: Strategies for sustainable agricultural production, food security, and greenhouse gas reduction. Agronomy 2025, 15, 565. [Google Scholar] [CrossRef]
Coleman, K.; Müller, J.; Kuenzer, C. Remote Sensing of Forests in Bavaria: A Review. Remote Sens. 2024, 16, 1805. [Google Scholar] [CrossRef]
Jabed, M.A.; Murad, M.A.A. Crop yield prediction in agriculture: A comprehensive review of machine learning and deep learning approaches, with insights for future research and sustainability. Heliyon 2024, 10, e40836. [Google Scholar] [CrossRef]
Yang, X.; Chen, J.; Lu, X.; Liu, H.; Liu, Y.; Bai, X.; Qian, L.; Zhang, Z. Advances in UAV Remote Sensing for Monitoring Crop Water and Nutrient Status: Modeling Methods, Influencing Factors, and Challenges. Plants 2025, 14, 2544. [Google Scholar] [CrossRef]
Pazmiño, Y.C.; Felipe, J.J.D.; Vallbé, M.; Cargua, F.; Pazmiño, Y. Evaluation of the synergies of land use changes and the quality of ecosystem services in the Andean zone of Central Ecuador. Appl. Sci. 2024, 14, 498. [Google Scholar] [CrossRef]
Seo, M.; Kim, H.C. Arctic Greening Trends: Change Points in Satellite-Derived Normalized Difference Vegetation Indexes and Their Correlation with Climate Variables over the Last Two Decades. Remote Sens. 2024, 16, 1160. [Google Scholar] [CrossRef]
Anas, M.; Waqar, N.; Noor Mohamed Ibrahim, P.M.S.; Hasan, M.; Jaseem, K.P.; Ahlawat, Y.; Jaiwal, P.K.; Chaudhary, D. Unraveling Solutions to the Complex Problems: Implementing Climate-Smart Crop Management Strategies. In Next-Generation Strategies for Crop Improvement; Springer Nature: Singapore, 2025; pp. 257–306. [Google Scholar] [CrossRef]
Tussupov, J.; Yessenova, M.; Abdikerimova, G.; Aimbetov, A.; Baktybekov, K.; Murzabekova, G.; Aitimova, U. Analysis of formal concepts for the verification of pests and diseases of crops using machine learning methods. IEEE Access 2024, 12, 19902–19910. [Google Scholar] [CrossRef]
Tussupov, J.; Abdikerimova, G.; Ismailova, A.; Kassymova, A.; Beldeubayeva, Z.; Aitimov, M.; Makulov, K. Analyzing disease and pest dynamics in steppe crop using structured data. IEEE Access 2024, 12, 71323–71330. [Google Scholar] [CrossRef]
Abdikerimova, G.; Khamitova, D.; Kassymova, A.; Bissengaliyeva, A.; Nurova, G.; Aitimov, M.; Shynbergenov, Y.A.; Yessenova, M.; Bekbayeva, R. Development of a Model for Soil Salinity Segmentation Based on Remote Sensing Data and Climate Parameters. Algorithms 2025, 18, 285. [Google Scholar] [CrossRef]
Abdikerimova, G.; Yessenova, M.; Yerzhanova, A.; Manbetova, Z.; Murzabekova, G.; Kaibassova, D.; Bekbayeva, R.; Aldashova, M. Applying textural Law’s masks to images using machine learning. Int. J. Electr. Comput. Eng. (2088-8708) 2023, 13, 5569–5575. [Google Scholar] [CrossRef]
Yessenova, M.; Abdikerimova, G.; Murzabekova, G.; Nurbol, K.; Glazyrina, N.; Adikanova, S.; Uzakkyzy, N.; Sadirmekova, Z.B.; Niyazova, R. Application of informative textural Law’s masks methods for processing space images. Int. J. Electr. Comput. Eng. (2088-8708) 2023, 13, 4557–4566. [Google Scholar] [CrossRef]
Zhang, J.L.; Huang, X.M.; Sun, Y.Z. Multiscale spatiotemporal meteorological drought prediction: A deep learning approach. Adv. Clim. Change Res. 2024, 15, 211–221. [Google Scholar] [CrossRef]
Badiguntla, A.K.; Sarma, S.L.; Basha, S.M.; Yamarthi, N.R. Deep Learning-Based Crop Prediction Using LSTM And GRU For Sustainable Agriculture. Int. J. Environ. Sci. 2025, 11, 2025. [Google Scholar]
Wang, R.; Ma, L.; He, G.; Johnson, B.A.; Yan, Z.; Chang, M.; Liang, Y. Transformers for remote sensing: A systematic review and analysis. Sensors 2024, 24, 3495. [Google Scholar] [CrossRef]
Sghir, A.; Iroshan, A. Machine Learning Integrated Climate-Agriculture Forecasting: A Transformer-Based Approach to Predict Precipitation and Wheat Production Amidst Climate Change. 2025. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5137941 (accessed on 1 March 2026).
Banerjee, S.; Mallick, T.; Chakroborty, A.; Saha, H.N.; Takur, N.T. Evaluation of State-of-the-Art Deep Learning Techniques for Plant Disease and Pest Detection. arXiv 2025, arXiv:2508.08317. [Google Scholar] [CrossRef]
Saki, M.; Keshavarz, R.; Franklin, D.; Abolhasan, M.; Lipman, J.; Shariati, N. A Data-Driven Review of Remote Sensing-Based Data Fusion in Precision Agriculture from Foundational to Transformer-Based Techniques. IEEE Access 2025, 13, 166188–166209. [Google Scholar] [CrossRef]
Tao, L.; Cui, Z.; He, Y.; Yang, D. An explainable multiscale LSTM model with wavelet transform and layer-wise relevance propagation for daily streamflow forecasting. Sci. Total Environ. 2024, 929, 172465. [Google Scholar] [CrossRef]
Hoque, M.J.; Islam, M.S.; Uddin, J.; Samad, M.A.; De Abajo, B.S.; Vargas, D.L.R.; Ashraf, I. Incorporating meteorological data and pesticide information to forecast crop yields using machine learning. IEEE Access 2024, 12, 47768–47786. [Google Scholar] [CrossRef]
Wang, J.; Zhang, D. Intelligent pest forecasting with meteorological data: An explainable deep learning approach. Expert Syst. Appl. 2024, 252, 124137. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the multi-branch hybrid model Wavelet_CNN_SARIMA_GRU_STTransformer for forecasting the annual pest infestation index.

Figure 2. Top 15 regions by average summer pest infestation index (zasel_l) for the period 2005–2022.

Figure 3. Diagram of the range of the summer pest infestation index (zasel_l).

Figure 4. Linear dependences of temperature and precipitation predictors (April–August) on the summer pest infestation index (zasel_l).

Figure 5. The importance of temperature and precipitation predictors in the XGBoost model in predicting the summer infestation index (zasel_l).

Figure 6. Mutual information between monthly temperature and precipitation predictors and the summer pest infestation index (zasel_l).

Figure 7. SHAP beeswarm plot of temperature predictors for the target variable (zasel_l).

Figure 8. SHAP-beeswarm diagram of the influence of monthly precipitation indicators on the forecast of the summer pest infestation index (zasel_l).

Figure 9. Comparative SHAP-based feature importance across Linear, Ridge, Random Forest, and MLP models.

Figure 10. SHAP-based feature importance of the proposed hybrid branch model (top 10 predictors).

Figure 11. Dynamics of the average index of summer infestation of grain crops by the pest Phyllotreta vittula (zasel_l) and the forecast of the linear regression model.

Figure 12. Forecast of the average summer infestation index of grain crops by the pest Phyllotreta vittula (zasel_l) using the Random Forest model.

Figure 13. Forecast of the average summer infestation index of grain crops by the pest Phyllotreta vittula (zasel_l) using the XGBoost model.

Figure 14. Forecast of the average summer infestation index of grain crops by the pest Phyllotreta vittula (zasel_l) using the multilayer perceptron baseline model (MLP baseline).

Figure 15. Forecast of the average summer infestation index of grain crops by the pest Phyllotreta vittula (zasel_l) using the Wavelet_CNN_SARIMA hybrid model.

Figure 16. Forecast of the average summer infestation index of grain crops by the pest Phyllotreta vittula (zasel_l) using the Wavelet_CNN_ARIMA hybrid model.

Figure 17. Forecast of the average summer infestation index of grain crops by the pest Phyllotreta vittula (zasel_l) using the Wavelet_CNN_LSTM hybrid model.

Figure 18. Forecast of the average summer infestation index of grain crops by the pest Phyllotreta vittula (zasel_l) using the Wavelet_CNN_GRU hybrid model.

Figure 19. Forecast of the average summer infestation index of grain crops by the pest Phyllotreta vittula (zasel_l) according to the proposed hybrid model Wavelet-SARIMA-GRU Spatio-Temporal Hybrid.

Table 1. Selected optimal hyperparameters.

Model	Selected Hyperparameters	Selection Criterion
Ridge	α = 0.1	Lowest CV RMSE
Random Forest	n_estimators = 300, min_samples_leaf = 2, max_features = ‘sqrt’	Best RMSE/low R² gap
XGBoost	learning_rate = 0.03, max_depth = 3, n_estimators = 300	Stable across folds

Table 2. Architectural specifications and hyperparameter configuration of all compared models.

Model	Type/Architecture	Key Hyperparameters/Structure	Input Features
LinearRegression	Linear regression	fit_intercept = True; feature standardization (StandardScaler)	Agroclimatic feature vector $x_{i}$ (year, region, temperature, precipitation, area)
Ridge ( $α = 0.1$ )	Linear regression with L2 regularization	$α = 0.1$ ; all other parameters identical to LinearRegression	Same feature vector $x_{i}$
RandomForest	Tree ensemble (RF Regressor)	n_estimators = 500; min_samples_split = 4; min_samples_leaf = 2; max_features = “sqrt”; bootstrap = True; random_state = 42	Standardized $x_{i}$
XGBoost	Gradient boosting over decision trees	n_estimators = 600; max_depth = 4; learning_rate = 0.05; subsample = 0.8; colsample_bytree = 0.8; random_state = 42	Standardized $x_{i}$
MLP	Fully connected neural network	Dense (128, ReLU) → Dropout (0.2) → Dense (64, ReLU) → Dropout (0.2) → Dense (32, ReLU) → Dense (1, linear); optimizer Adam (1 × 10⁻³)	Standardized $x_{i}$
Wavelet_CNN_ARIMA	Hybrid Wavelet–CNN + ARIMA (residual model)	ARIMA (1, 0, 0); residual window (q = 3); CNN: Conv1D (32) → MaxPool → Conv1D (16) → Flatten → Dense (32) → Dropout (0.2) → Dense (1); Adam (1 × 10⁻³)	Wavelet-transformed annual sequence $X_{t} \in R^{3 \times 3}$
Wavelet_CNN_SARIMA	Wavelet–CNN + SARIMA (residual model)	SARIMA (1, 0, 0) × (1, 0, 0) (_3); CNN frontend identical to Wavelet_CNN_ARIMA	$X_{t}$
Wavelet–CNN–LSTM	Wavelet–CNN with LSTM head	CNN frontend → LSTM (32) → Dropout (0.2) → Dense (1); Adam (1 × 10⁻³)	$X_{t}$
Wavelet–CNN–GRU	Wavelet–CNN with GRU head	CNN frontend → GRU (32) → Dropout (0.2) → Dense (1); Adam (1 × 10⁻³)	$X_{t}$
Wavelet_CNN_SARIMA_GRU_STTransformer	Multi-branch hybrid: SARIMA + CNN + CNN–GRU + ST-Transformer	SARIMA (1, 0, 0) × (1, 0, 0) (3); Branch 1—CNN: Conv1D (32) → MaxPool → Conv1D (16) → GlobalAvgPool → Dense (32 → 16); Branch 2—CNN + GRU (32) → Dense (16); Branch 3—ST-Transformer: (d_model = 64), 8-head MHA, FFN, GlobalAvgPool; fusion of three branches → Dense (32 → 16 → 1)	Wavelet sequence $X_{t}$ + SARIMA residuals $y_{t}$

Table 3. Chronological cross-validation protocol.

Fold	Training Years	Test Year
1	2005–2014	2015
2	2005–2015	2016
3	2005–2016	2017
4	2005–2017	2018
5	2005–2018	2019
6	2005–2019	2020
7	2005–2020	2021
8	2005–2021	2022

Table 4. Evaluation results on the training, validation, and test sets after fixing the model parameters.

Model	Split	MSE	RMSE	MAE	R²	n_years
Linear regression	training	0.0541	0.2325	0.1761	0.8257	14
Linear regression	validation	0.0006	0.0251	0.0249	0.8550	2
Linear regression	test	0.0070	0.0834	0.0825	−0.4570	2
Random Forest	training	0.0035	0.0592	0.0520	0.9887	14
Random Forest	validation	0.0506	0.2250	0.2225	−10.6618	2
Random Forest	test	0.1031	0.3211	0.3136	−20.5943	2
XGBoost	training	0.0045	0.0670	0.0536	0.9855	14
XGBoost	validation	0.0105	0.1026	0.0938	−1.4244	2
XGBoost	test	0.0777	0.2787	0.2374	−15.2717	2
MLP baseline	training	0.1104	0.3323	0.2338	0.6439	14
MLP baseline	validation	0.0094	0.0972	0.0902	−1.1758	2
MLP baseline	test	0.0013	0.0355	0.0302	0.7367	2
Wavelet_CNN_SARIMA	training	0.3331	0.5771	0.3879	0.1080	11
Wavelet_CNN_SARIMA	validation	0.0005	0.0229	0.0228	0.8790	2
Wavelet_CNN_SARIMA	test	0.0065	0.0805	0.0784	−0.3563	2
Wavelet_CNN_ARIMA	training	0.2804	0.5295	0.3744	0.2492	11
Wavelet_CNN_ARIMA	validation	0.0155	0.1246	0.1198	−2.5757	2
Wavelet_CNN_ARIMA	test	0.0520	0.2281	0.2226	−9.9001	2
Wavelet_CNN_LSTM	training	0.3875	0.6225	0.3636	−0.0377	11
Wavelet_CNN_LSTM	validation	0.0010	0.0315	0.0289	0.7712	2
Wavelet_CNN_LSTM	test	0.0069	0.0831	0.0765	−0.4463	2
Wavelet_CNN_GRU	training	0.4037	0.6354	0.3768	−0.0813	11
Wavelet_CNN_GRU	validation	0.0024	0.0486	0.0448	0.4560	2
Wavelet_CNN_GRU	test	0.0047	0.0684	0.0601	0.0200	2
Wavelet_CNN_SARIMA_GRU_STTransformer	training	0.3514	0.5928	0.3746	0.0590	11
Wavelet_CNN_SARIMA_GRU_STTransformer	validation	0.0229	0.1514	0.1414	−4.2808	2
Wavelet_CNN_SARIMA_GRU_STTransformer	test	0.0355	0.1885	0.1855	−6.4435	2

Table 5. Results of the Ljung–Box test for SARIMA residuals (autocorrelation test).

ID	Ljung–Box Statistic (lb_stat)	p-Value of the Ljung–Box Test (lb_pvalue)
1	0.37725	0.539078
2	3.194833	0.202419
3	3.585907	0.30979
4	3.795268	0.434422

Table 6. Fragment of the panel dataset (example rows for 2005).

ID	t_april	t_may	t_june	t_july	t_august	p_april	p_may	p_june	p_july	p_august	audan_l	kol_l_1	kol_l_2
1	18.7540	19.3220	27.2690	29.9400	21.9330	0.4050	0.0310	0.7980	0.7820	0.6360	2.2630	0.8560	15.1426
2	15.2020	19.4180	24.4140	28.4830	24.0620	0.8090	0.9420	0.9130	0.1850	0.3470	0.0320	0.0450	5.7250
3	18.9290	21.4850	23.1240	28.1080	22.4400	0.7830	0.4880	0.8850	0.6830	0.1170	1.5570	0.5700	27.1837
4	15.3540	21.4470	23.6890	27.7330	28.0880	0.2920	0.6660	0.8440	0.3590	0.0290	2.0980	0.7070	4.3176
5	15.1910	21.1780	24.6930	23.9090	22.0790	0.0710	0.2430	0.9390	0.5250	0.4450	2.3750	0.0190	12.4620

Table 7. Descriptive statistics of the main variables of the panel dataset (n = 4500 observations).

Variable	Mean	Std	Min	25%	50%	75%	Max
t_april	14.7987	4.194	−0.6667	14.3670	15.8660	17.4627	41.0
t_may	20.6637	3.672	1.18	19.1130	21.1680	23.0770	28.282
t_june	24.0999	2.541	11.19	22.0680	24.1919	26.1090	33.333
t_july	27.2262	2.846	13.28	25.0225	27.1480	29.5090	35.4520
t_august	25.2255	2.736	13.73	22.9429	25.1980	27.5233	31.94
p_april	0.4737	0.288	0.000	0.2067	0.4541	0.7310	0.9993
p_may	0.4714	0.284	0.000	0.2160	0.4480	0.7161	1.000
p_june	0.4824	0.286	0.001	0.2250	0.4590	0.7350	1.000
p_july	0.4675	0.285	0.000	0.2150	0.4415	0.7160	1.2685
p_august	0.4737	0.287	0.000	0.2132	0.4530	0.7200	1.000
audan_l	2.5801	6.326	0.001	0.5887	1.4615	2.3655	141.367
zasel_l	1.2890	4.718	0.000	0.2500	0.5000	0.7920	116.28
kol_l	23.3870	41.16	0.0905	2.4696	7.2178	22.1255	373.92

Table 8. Descriptive statistics by data split (train/validation/test).

Split	Zasel_l Mean	Zasel_l Std	Median	Min	Max
Train	1.3213	5.0453	0.500	0.000	116.29
Validation	1.2449	3.7328	0.505	0.001	32.14
Test	1.1071	2.8685	0.500	0.003	32.14

Table 9. Climatic features (April–June).

Split	t_april Mean	t_april Std	t_may Mean	t_may Std	t_june Mean	t_june Std	p_april Mean	p_april Std	p_may Mean	p_may Std
Train	15.003	4.086	20.635	3.631	24.118	2.563	0.477	0.291	0.472	0.289
Validation	14.212	4.284	20.649	3.699	23.850	2.520	0.466	0.284	0.464	0.271
Test	13.957	4.677	20.882	3.928	24.226	2.395	0.456	0.274	0.478	0.267

Table 10. Derived seasonal temperature and precipitation indicators (descriptive statistics, n = 4500).

Variable	Mean	Std	Min	25%	50%	75%	Max
t_early_mean	17.731	3.348	1.0910	17.278	18.630	19.718	28.416
t_mid_mean	25.663	2.021	12.237	24.355	25.679	26.969	33.171
t_late_mean	25.225	2.736	13.734	22.942	25.198	27.523	31.944
t_season_mean	22.402	1.967	9.395	21.807	22.773	23.584	28.082
t_season_range	13.4684	4.2432	4.2850	10.5858	12.8175	15.243	34.833
p_early_sum	0.9451	0.4105	0.0050	0.6293	0.9490	1.2380	1.9850
p_mid_sum	0.9499	0.4158	0.0450	0.6160	0.9423	1.2450	1.9920
p_late_sum	0.4737	0.2879	0.0000	0.2132	0.4530	0.7200	1.0000
p_season_sum	2.3688	0.6794	0.4140	1.8828	2.3650	2.8363	4.4440
p_season_max	0.8074	0.1668	0.1700	0.7238	0.8550	0.9370	1.2685

Table 11. Comparison of the accuracy of basic and hybrid models for forecasting the summer infestation index

{z a s e l}_{l}

.

Table 11. Comparison of the accuracy of basic and hybrid models for forecasting the summer infestation index

{z a s e l}_{l}

.

Model	MSE	RMSE	MAE	$(R^{2}$ )
LinearRegression	3.5172	1.8754	0.9144	0.6821
Ridge ((α= 0.1))	3.5172	1.8754	0.9144	0.6821
RandomForest	4.3012	2.0739	0.9291	0.6112
XGBoost	5.1692	2.2736	0.8641	0.5328
MLP	4.1316	2.0326	1.0386	0.6266
Wavelet_CNN_ARIMA	0.0018	0.0422	0.0399	0.8089
Wavelet_CNN_SARIMA	0.0034	0.0579	0.0466	0.6393
Wavelet-CNN-LSTM	0.0026	0.0513	0.0438	0.7166
Wavelet-CNN-GRU	0.0028	0.0533	0.0397	0.6940
Wavelet_CNN_SARIMA_GRU_STTrans	0.0005	0.0223	0.0178	0.9466

Table 12. Results of ablation analysis of the components of the hybrid model.

Model	MSE	RMSE	MAE	R2	Delta_R2_vs_full
Wavelet_CNN_SARIMA_GRU_STTransformer	0.001054	0.032468	0.026706	0.886641	0
Wavelet_CNN_GRU	0.001105	0.033238	0.022619	0.881204	−0.005436
SARIMA_only	0.002241	0.047344	0.039759	0.758968	−0.127672
Wavelet_CNN_SARIMA_GRU (no ST)	0.002481	0.049809	0.03995	0.733214	−0.153427
NoWavelet_CNN_SARIMA_GRU_STTransformer	0.003945	0.06281	0.049426	0.575775	−0.310865
Wavelet_CNN_SARIMA	0.006735	0.082064	0.063037	0.275813	−0.610828

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Akanova, A.; Ospanova, N.; Muratova, G.; Sharipova, S.; Tokzhigitova, N.; Anarbekova, G. A Hybrid Algorithm Combining Wavelet Analysis and Deep Learning for Predicting Agroclimatic Pest Infestations. Algorithms 2026, 19, 242. https://doi.org/10.3390/a19030242

AMA Style

Akanova A, Ospanova N, Muratova G, Sharipova S, Tokzhigitova N, Anarbekova G. A Hybrid Algorithm Combining Wavelet Analysis and Deep Learning for Predicting Agroclimatic Pest Infestations. Algorithms. 2026; 19(3):242. https://doi.org/10.3390/a19030242

Chicago/Turabian Style

Akanova, Akerke, Nazira Ospanova, Gulzhan Muratova, Saltanat Sharipova, Nurgul Tokzhigitova, and Galiya Anarbekova. 2026. "A Hybrid Algorithm Combining Wavelet Analysis and Deep Learning for Predicting Agroclimatic Pest Infestations" Algorithms 19, no. 3: 242. https://doi.org/10.3390/a19030242

APA Style

Akanova, A., Ospanova, N., Muratova, G., Sharipova, S., Tokzhigitova, N., & Anarbekova, G. (2026). A Hybrid Algorithm Combining Wavelet Analysis and Deep Learning for Predicting Agroclimatic Pest Infestations. Algorithms, 19(3), 242. https://doi.org/10.3390/a19030242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Algorithm Combining Wavelet Analysis and Deep Learning for Predicting Agroclimatic Pest Infestations

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Data Description and Feature Construction

3.2. Hyperparameters and Model Configurations

3.3. Experimental Design and Model Training

4. Results

4.1. Statistical and Visual Analysis of Data Structure

4.2. Evaluation of the Model’s Forecasting Quality

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI