1. Introduction
Wheat is one of the world’s three primary cereal grains, and it plays a decisive role in ensuring global food security and stabilizing socio-economic development. As a staple crop providing the largest share of global caloric intake, wheat production contributes significantly to the world’s food supply and nutrition structure [
1,
2,
3,
4]. Accurate and timely wheat yield prediction is, therefore, essential for preventing food shortages, optimizing agricultural input allocation, and supporting national macro-level policy-making [
5,
6]. China, the world’s largest wheat producer, accounts for 11.26% of the total global wheat planting area and 17.98% of the global wheat output [
7]. Among its production regions, winter wheat grown between the Yellow River and the Huai River occupies an absolutely dominant position, contributing over 85% of the country’s total summer grain yield [
8]. Consequently, improving the precision of wheat yield forecasting is of strategic importance for national food security and global agricultural stability.
Wheat yield is affected by a complex set of interacting elements, which can be categorized into direct and indirect influencing factors. Direct factors refer to climatic and environmental variables that directly affect crop growth physiology, including average temperature and precipitation. Indirect factors, on the other hand, shape farmers’ production decisions, resource inputs, and resilience to environmental variability. These include indicators such as the total power of agricultural machinery, the total agricultural output value, the comprehensive output of agriculture–forestry–animal husbandry–fishery sectors, the total grain output, the cultivated land area, and disaster-affected areas. Although climate factors have traditionally been considered the dominant predictors of crop yields, modern agricultural mechanization has increasingly enabled farmers to buffer or offset adverse weather impacts through enhanced operational efficiency and improved management practices [
9,
10,
11]. Our empirical findings further reveal that, for the studied regions, the average temperature and precipitation are no longer the most influential factors. Instead, several indirect socio-economic indicators show a stronger association with wheat yields. This highlights the necessity of incorporating both direct and indirect factors into prediction models—an important aspect that is largely neglected in existing crop yield forecasting studies.
A substantial body of research has explored machine learning and deep learning techniques for crop yield prediction. Traditional statistical models—such as linear regression, ARIMA, support vector regression, and decision-tree–based ensembles—provide basic predictive capabilities but often struggle to represent nonlinear multivariate time-series relationships within agricultural systems [
12]. With the rise of deep learning, various models including convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and hybrid architectures with attention mechanisms have demonstrated improved performance in capturing complex spatiotemporal patterns [
6,
12,
13,
14,
15,
16]. Recent studies have applied deep neural networks to integrate climatic, ecological, and remote sensing features in order to enhance model accuracy [
17,
18]. Meanwhile, attention mechanisms have been introduced to identify key temporal features and enhance interpretability in yield prediction tasks [
16]. Despite these advancements, two critical challenges persist:
(1) The scarcity of high-quality, long-term multivariate agricultural time-series data in many regions.
(2) The lack of explicit consideration for indirect socio-economic and policy-related factors, which play increasingly significant roles under modern agricultural conditions.
To address these limitations, this paper proposes a Transfer-learning-based Parallel CNN-LSTM-Attention (TPCLA) model that integrates both direct and indirect yield-influencing variables. A cross-regional transfer learning strategy is employed to alleviate the problem of limited temporal samples by transferring knowledge from regions with similar climatic characteristics [
19]. Meanwhile, the designed parallel CNN-LSTM architecture enables the extraction of spatial features via 1D convolution while simultaneously capturing long-term temporal dependencies through LSTM. An attention mechanism is applied to further highlight the most influential features and enhance model interpretability. Beyond improving predictive accuracy, an important contribution of this work is the explicit incorporation of indirect socio-economic factors—such as the mechanization level and agricultural output indicators—which reveals their substantial influence on wheat yield. This fills an important gap in the existing literature and provides a valuable empirical reference for policy-making and agricultural planning.
The main contributions of this paper are as follows.
(1) A cross-regional transfer learning strategy is introduced to overcome the scarcity of wheat yield time-series data, enabling more robust temporal feature extraction.
(2) A parallel CNN-LSTM-Attention network is designed to preserve spatiotemporal feature integrity, enhance data utilization, and improve prediction performance.
(3) Both direct and indirect yield-influencing factors are incorporated, demonstrating that indirect socio-economic variables significantly contribute to wheat yield prediction and filling a research gap in existing crop forecasting methodologies.
(4) Extensive experiments on multivariate wheat time-series data from 1993 to 2023 validate the accuracy, stability, and generalization capability of the proposed TPCLA model.
2. Related Works
In recent decades, many researchers have increasingly focused on improving crop yield prediction through different methods, including empirical statistical models, process-oriented crop growth models, and prediction through remote sensing data [
5,
20,
21]. Traditional statistical models predict yields by establishing regression equations between weather variables (such as temperature, precipitation, solar radiation, etc.) and the yields measured at different time and spatial scales [
4,
22,
23].
Process-based crop simulation models present a mechanistic alternative to purely statistical approaches for yield prediction. Models such as the Decision Support System for Agrotechnology Transfer (DSSAT) [
24], the Agricultural Production Systems Simulator (APSIM) [
25], and the World Food Studies (WOFOST) [
8] simulate crop growth and development by modeling underlying biophysical processes (e.g., photosynthesis, phenology, and soil–water dynamics). Their primary strength lies in their strong interpretability and the ability to conduct “what-if” scenario analyses under changing environmental conditions [
21,
26]. However, a significant limitation hindering their widespread operational application is their dependency on extensive, high-quality input parameters—including detailed soil profiles, cultivar-specific genetic coefficients, and precise daily management data—which are often difficult or costly to obtain at regional scales [
27]. This data requirement challenge, coupled with the computational complexity of running these models, has motivated the exploration of data-driven machine learning approaches that can learn complex, non-linear relationships directly from more readily available historical data, thus providing a complementary and often more practical pathway for large-area yield forecasting [
28].
Substantial research efforts have been devoted to developing models for wheat yield forecasting, particularly through the utilization of multivariate time-series data. Earlier studies relied predominantly on linear statistical techniques to capture temporal yield variations. However, with advances in computing capabilities, machine learning and deep learning methods have increasingly been adopted across diverse domains—including image analysis, language processing, and signal interpretation [
27]. Classical machine learning algorithms such as support vector machines (SVMs) and random forests (RFs) have been extensively applied in remote sensing–related tasks [
28,
29] and in agricultural yield prediction [
30,
31,
32]. Building on these developments, a range of machine learning, deep learning, and hybrid frameworks have reported improved performance for wheat yield estimation.
Alongside these approaches, statistical analysis methods have continued to evolve. For example, Niedbała G. [
33] proposed a multi-head linear regression framework for wheat prediction, while Amin et al. [
34] demonstrated the effectiveness of the AutoRegressive Integrated Moving Average (ARIMA) model for forecasting wheat yields. Nevertheless, methods grounded solely in statistical assumptions often struggle to represent complex nonlinear dependencies, and their predictive capability may degrade when dealing with extended time horizons or numerous interacting variables.
Recent years have seen rapid advances in machine learning and deep learning, leading to their widespread use in various predictive modeling tasks. Sequence-based architectures, particularly RNN and lSTM models, have gained prominence due to their ability to represent temporal dynamics. Comparative evaluations of multiple machine learning models—including LSTM—have shown that both linear regression and LSTM approaches can produce competitive results in agricultural yield forecasting [
35]. Enhanced variants of LSTM have also been introduced, such as the Deep LSTM (DLSTM) model, which demonstrated improved accuracy for production-related time-series problems [
36], and an LSTM optimized using the Improved Optimization Framework (IOF), which further strengthened yield prediction performance [
37]. Additional hybrid designs have been proposed, including an ARIMA–LSTM combination for wheat forecasting [
38] and a CNN–LSTM architecture augmented with multi-head attention and skip connections [
39].
Although the aforementioned deep learning models have achieved promising results in capturing temporal patterns and improving predictive performance, most existing architectures rely primarily on climatic variables or single-dimensional temporal features. For instance, LSTM variants and their optimized forms [
36,
37,
40] exhibit strong nonlinear modeling capability, yet their performance may degrade in data-scarce scenarios due to the absence of mechanisms for parameter reuse across regions. Similarly, hybrid ARIMA–LSTM approaches [
41] and CNN–LSTM-based architectures with attention and skip connections [
39] effectively extract spatiotemporal features but often require sufficiently large datasets to fully exploit their representational capacity. In contrast, the transfer learning strategy adopted in the present study aims to alleviate limitations associated with small-sample datasets by allowing the model to reuse knowledge from ecologically similar regions, thereby improving generalization performance without depending solely on increasing data volume.
Furthermore, recent studies have incorporated attention mechanisms into neural architectures to emphasize critical temporal features and enhance interpretability, as demonstrated in attention-augmented LSTM and CNN–LSTM models [
39]. While such methods strengthen feature discrimination, they predominantly focus on direct climatic variables and rarely integrate indirect socio-economic indicators that also influence yield fluctuations. The approach proposed in this study complements existing research by jointly modeling direct meteorological factors and indirect agricultural variables within a parallel spatiotemporal architecture. The inclusion of cross-regional transfer learning further enables the model to capture invariant patterns that conventional deep learning models [
36,
37,
40] may overlook. Rather than replacing prior architectures, the present framework extends their applicability to small-sample, multi-factor agricultural prediction scenarios.
4. Experiments
4.1. Wheat Dataset Preprocessing
The wheat production data are sourced from government statistical yearbooks of multiple provinces in China from 1993 to 2024. As official datasets, they are considered highly reliable and require no exception handling. The dataset includes eight features: rainfall, the average temperature, the total cultivated land area, the total grain output, the total output value of agriculture, forestry, animal husbandry, and fishery, the total agricultural output value, the total power of agricultural machinery, and the disaster-affected area. The target variable is the wheat yield per unit area. The data from 1993 to 2012 were adopted as the training data. The data from 2013 to 2016 were used as verification data. The data from 2017 to 2024 were used as test data.
4.2. Feature Analysis
A feature analysis was conducted on the above-mentioned feature data to obtain the influence weights of each feature on wheat yields. This paper adopts the Pearson correlation coefficient and the Spearman correlation coefficient for the correlation analysis of each feature. The calculation formulas are as follows, respectively:
where
is the Pearson correlation coefficient and the Spearman correlation coefficient,
is the feature and the target feature, respectively,
is the total covariance of
X and
Y,
is the standard deviation of
X and
Y,
n is the data point, and
is the grade difference of
and
.
As shown in
Table 2, both Pearson and Spearman correlation coefficients indicate that the total power of agricultural machinery exhibits the highest correlation, suggesting that it has the most substantial impact on wheat yields, followed by the gross agricultural output value. Furthermore, the affected area demonstrates a negative correlation with wheat yields, whereas average temperature and precipitation show the least influence.
4.3. Wheat Yield Forecasting Performance Evaluation of the Models
The hyperparameters optimized using the particle swarm algorithm are listed in
Table 3. All models use the Adam optimizer and the MAE as the loss function.
As shown in
Table 4, all models maintain prediction errors below 100 units at the thousand-unit scale while achieving relatively high confidence levels within a 2% confidence interval. These results demonstrate reliable predictive capability despite the inherent complexity of crop yield systems, justifying the selection of these models as benchmarks.
Specifically, the LSTM-Attention model achieves the smallest prediction errors in most years, attaining optimal forecasting accuracy (CD = 0.194) in the 2024 comprehensive evaluation. The traditional RNN exhibits relatively larger errors, whereas the TPCLA model demonstrates the most prominent confidence performance (0.803). Overall, all models effectively predict wheat yields while maintaining high confidence levels, validating the applicability and stability of the selected modeling framework.
As shown in
Figure 4, the predicted wheat yield values from all models generally align with the actual values in terms of trends, though certain years exhibit notable systematic prediction deviations. Specifically, in 2018, all models consistently overestimated the actual yields, with the LSTM-Attention model showing the most significant deviation (predicted: 4.640; vs. actual: 4.492). Conversely, in 2021, all models except PCLA underestimated the actual yield, indicating a systematic underestimation.
In terms of long-term performance, the TPCLA model demonstrated the best prediction stability throughout the entire period, with its prediction curve being closest to the actual values. Particularly between 2021 and 2024, the prediction accuracy of TPCLA was significantly superior to other benchmark models. It is noteworthy that the traditional RNN model exhibited large prediction errors in multiple years (e.g., 2022 and 2023), while the LSTM-Attention model tended to consistently overestimate the actual yield.
The phenomenon of multiple models exhibiting unidirectional prediction biases in specific years suggests the potential presence of systematic external influencing factors not captured in the model feature set, such as extreme climate events, agricultural policy adjustments, or global market fluctuations. This observation not only validates the consistency characteristics of the model predictions but also provides important clues for further investigation into key external factors affecting wheat yields.
As can be seen from
Figure 5, the TPCLA model exhibits a narrower error distribution that is more concentrated near zero with a distinct left-skewed pattern. The PCLA model also demonstrates errors close to zero but with a wider distribution and notable right-skewed characteristics. Although the LSTM-Attention model achieves the narrowest error distribution, its overall errors deviate significantly from zero, indicating larger prediction inaccuracies and consequently higher training errors compared to the former two models. In contrast, both the RNN and LSTM models display wide error distributions with a substantial deviation from zero, resulting in notably inferior prediction accuracy relative to the LSTM-Attention, PCLA, and TPCLA models.
In summary, the TPCLA model optimized through cross-regional transfer learning achieves a compact error distribution centered near zero, effectively enhancing data utilization efficiency. This approach fully leverages the value of limited data samples and addresses data scarcity challenges in wheat yield prediction by enabling extensive feature learning through transfer learning mechanisms.
By comparing the model performance of the test data throughout the entire cycle in
Table 5, TPCLA performed the best. The TPCLA model demonstrates superior performance in the comparative analysis of model performance using full test period data. Compared to the suboptimal LSTM-Attention model, it achieves an 18.36% reduction in RMSE, a 12.60% decrease in MAE, and a 4.39% improvement in R
2, effectively validating the feasibility of model optimization through transfer learning on small-sample wheat yield datasets. Furthermore, the LSTM-Attention model without transfer learning still outperforms other benchmark architectures in evaluation metrics, indicating that this attention-enhanced model can further improve prediction accuracy by enhancing data utilization. In conclusion, effective knowledge transfer can be achieved through cross-domain pretraining for parameter initialization and subsequent model fine-tuning. This approach helps overcome data scarcity constraints in target regions while improving prediction accuracy and fitting performance, thereby providing a viable methodology for accurate and efficient crop yield prediction.
5. Discussion
Our experimental results demonstrate the effectiveness of the proposed TPCLA model in predicting wheat yields using limited time-series data. A particularly noteworthy finding emerged from the systematic analysis of prediction errors across multiple models. As illustrated in
Figure 4, all models consistently overestimated the actual yield in 2018, while a collective underestimation occurred in 2021. This consistent directional bias across diverse architectures strongly suggests the influence of external drivers not captured in the feature set, rather than model failure. In fact, this sensitivity underscores the model’s capacity to detect unquantified external shocks.
In
Table 2, a further feature analysis identified total power of agricultural machinery as the most influential positive factor, while disaster-affected area exhibited a negative correlation. This provides a plausible explanation for the observed biases: the overestimation in 2018 may be attributed to unrecorded extreme climate events (e.g., regional drought) that reduced actual yields beyond model expectations, whereas the underestimation in 2021 could reflect the impact of a potent yield-enhancing policy (e.g., the promotion of new cultivars or temporary subsidies), whose effect exceeded projections based on historical data alone.
This insight elevates the role of our model from a mere forecasting tool to a diagnostic instrument, capable of retrospectively revealing significant external factors—such as policy efficacy or extreme weather impacts—that are otherwise poorly documented. Such a capability offers quantitative support for governmental evaluation of agricultural policies and the development of risk mitigation strategies.
In terms of predictive performance, the TPCLA model achieved optimal results across all core metrics—RMSE, MAE, and R2. It reduced RMSE and MAE by 18.36% and 12.60%, respectively, compared to the suboptimal LSTM-Attention model, while elevating R2 to 0.904. This marked improvement confirms the efficacy of cross-regional transfer learning, which enables the model to learn common yield-influencing patterns (e.g., climatic trends and cropping systems) during pre-training, and subsequently adapt to the local characteristics of the target region during fine-tuning. This approach fundamentally mitigates overfitting issues common in small-sample scenarios and enhances the generalization capability.
5.1. Comparative Advantages over Biophysical Models
When compared to traditional biophysical crop models, the TPCLA framework exhibits several distinct advantages in the context of regional yield prediction:
(1) Reduced Data Dependency and Enhanced Practicality: Biophysical models require extensive and often unavailable input parameters, such as detailed soil properties, genetic coefficients, and daily management records. In contrast, TPCLA relies solely on publicly available macroscopic indicators (e.g., cultivated area, gross agricultural output, machinery power), significantly lowering data acquisition barriers and enabling scalable and rapid yield estimation.
(2) Integration of Complex System Dynamics and Implicit Knowledge: While biophysical models excel at simulating known physiological processes, they struggle to incorporate socio-economic and human decision factors, such as policy shifts or market responses. TPCLA, as a data-driven approach, automatically learns the composite effects of these factors from historical data. The systematic prediction biases observed in 2018 and 2021 exemplify the model’s ability to internalize the impact of external shocks not explicitly included in the feature set.
(3) Computational Efficiency and Rapid Deployment: Biophysical models are computationally intensive, often requiring complex simulations at fine spatial resolutions. Once trained, TPCLA performs predictions via a single forward pass, allowing frequent and timely updates—a critical feature for supporting real-time agricultural decision-making.
(4) Generalization via Knowledge Transfer in Small-Sample Settings: A key contribution of this study is the use of cross-regional transfer learning to overcome data scarcity. By pre-training on data from agronomically similar regions, the model captures universal spatio-temporal patterns before fine-tuning on the target region (Shandong). This “learn-and-adapt” strategy effectively mitigates overfitting and yields more robust predictions on limited local data (31 years), as validated through the test results.
5.2. Rationale for Feature Inclusion and Architectural Design
The inclusion of the Gross Output Value of Agriculture, Forestry, Animal Husbandry, and Fishery is statistically justified due to its strong correlation with wheat yield (Pearson = 0.795, Spearman = 0.734). This variable serves as a proxy for regional agricultural development, indirectly capturing the effects of technological progress, capital investment, and policy support—factors that are difficult to quantify directly but critically influence yield outcomes.
In the parallel CNN-LSTM-Attention architecture, the 1D-CNN branch is employed to capture short-term local temporal patterns within multi-year windows. This design is motivated by the recognition that crop yields are often influenced by sequential conditions over consecutive years (e.g., sustained investment in agricultural machinery). While LSTM models long-term temporal dependencies, the CNN complements it by detecting localized, multi-year interactions that may signify critical preparatory phases for high yields. This parallel setup allows the model to leverage both short-term fluctuations and long-term trends, enhancing its capacity to represent complex agricultural systems without manual feature engineering.
6. Conclusions
This study investigates the effectiveness of transfer learning in improving wheat yield prediction under small-sample conditions. To address the challenges posed by limited data availability and complex multivariate dependencies, a Transfer-learning-based Parallel CNN–LSTM–Attention (TPCLA) model is proposed. By integrating cross-regional transfer learning with a parallel spatiotemporal feature extraction framework, the model enhances data utilization and effectively captures invariant yield-related patterns.
Comparative experiments among five deep learning architectures—RNN, LSTM, LSTM–Attention, PCLA, and TPCLA—demonstrate that TPCLA consistently achieves the highest accuracy and robustness across all evaluation metrics. The results confirm that transfer learning can mitigate the effects of data scarcity and improve model generalization, especially when yield is influenced by both direct climatic variables and indirect socio-economic factors. An analysis of prediction residuals further indicates that unobserved external influences, such as weather anomalies, policy interventions, and market fluctuations, contribute to systematic deviations. These findings highlight the practical value of TPCLA for supporting agricultural planning and policy formulation in data-constrained settings.
Future Work: Although the incorporation of indirect socio-economic indicators improves prediction performance, some variables—such as the total agricultural output value—naturally exhibit upward trends due to macroeconomic growth, inflation, or rising GDP. This may introduce spurious correlations or inflated importance in the model, especially when the true crop yield remains stable over time. Addressing this limitation represents an important direction for future research. Potential solutions include the following: (1) The de-trending or economic normalization of long-term socio-economic indicators; (2) Employing causality-aware feature selection or structural causal models to disentangle genuine yield determinants from confounding macroeconomic trends; (3) Designing trend-robust architectures that explicitly separate short-term agronomic signals from long-term economic drift.
Future work will explore these approaches to further enhance model interpretability and prevent biased correlations that can arise from inherently trending variables.