1. Introduction
Yunnan, situated on a low-latitude plateau, boasts a unique geographical location with complex terrain and landforms, characterized by steep mountains, deep valleys, criss-crossing gullies, and a network of rivers. The province is lower in the south and higher in the north, with an elevation difference of 6600 m. Due to the complex terrain of Yunnan, with a significant proportion of mountainous and hilly areas and limited flat land suitable for cultivation, the availability of arable land is limited, resulting in an overall low arable land cultivation index of approximately 12%. Additionally, as the majority of Yunnan’s important arable land is mountainous and hilly, with only 10% being relatively flat, the remaining areas feature significant topographical variations that decrease from west to south [
1]. Based on these topographical conditions, rice cultivation in Yunnan is predominantly mountainous, often resulting in small-scale rice fields and multiple rice fields within the same area. Some regions primarily utilize terraced fields for cultivation [
2]. Notably, in rice breeding scenarios, small-scale experimental plots under greenhouse environments also serve as core planting areas—yield accuracy here directly determines the efficiency of superior variety selection, yet this scenario has long been overlooked in existing yield prediction studies. Currently, researchers engaged in rice yield-related studies primarily focus on large-scale rice yield estimation, with very few researchers targeting precise yield prediction for small-scale rice cultivation. In addition, for the Yunnan plateau terrain and greenhouse microenvironment, it is quite different from the open-air conditions, which increases the difficulty of field image acquisition. Therefore, addressing the challenge of precise yield prediction for small-scale rice cultivation can alleviate the pain points of rice yield prediction in Yunnan.
Rice stands as one of the most significant food crops globally, ensuring the survival of over 50% of the world’s population and ranking as one of the most crucial field crops [
3]. Historically, rice yield prediction has primarily relied on technologies such as geographic information systems for large-scale estimation, a method that is time-consuming and exhibits relatively low prediction accuracy, suitable for predicting rice yields over vast areas. In recent years, with the profound development of artificial intelligence technology, the adoption of machine learning and deep learning algorithms to construct rice yield prediction models has emerged as a trend in technological advancement [
4]. Notably, deep learning, as the most effective algorithm in crop phenotype prediction, has become a focal point of research in the field of smart agriculture [
5]. However, the complex and diverse rice planting environments, including the topographic complexity and climate variability of Yunnan’s plateau open fields as well as the small-scale plot characteristics and micro-environmental differences between varieties in greenhouse breeding scenarios, together with the unpredictability of climate change, have diminished the accuracy of deep learning algorithm modeling. Therefore, the urgent issue in the field of smart agriculture is to integrate deep learning algorithms with various other technologies to construct a novel model capable of precisely predicting rice yields in small-scale planting areas.
In recent years, the robust image processing capabilities of deep learning have emerged as a pivotal aspect in the field of smart agriculture, facilitating the identification of pests and diseases, yield prediction, quality assessment, and variety classification in crop phenotypic monitoring [
6]. A novel pathway for constructing intelligent models based on crop phenotyping has gradually emerged. Such deep learning-based models have been widely applied in crop yield prediction research: Jeong, Seungtaek et al. [
7] integrated deep learning, remote sensing, and process-based crop models to develop a rice yield prediction model, enhancing the accuracy of rice yield prediction. Tsouli Fathi et al. [
8] compared DNN with existing ML techniques, formulating a deep learning model for crop yield prediction based on agricultural chemical and climatic data in the Mediterranean region, achieving a low prediction error. Arun Kumar Sangaiah et al. [
9] proposed the SmartAgri-Net (SA-Net) decision support system and recommended support for designing and deploying SA-Net for smartphone applications. Bhojani and Bhatt [
10] introduced a novel activation function to enhance the DNN algorithm, constructing a wheat yield prediction model. Yang et al. [
11] compared RGB-based CNN models with traditional regression models, proposing an improved CNN model to predict rice grain yield. Temporal Convolutional Networks (TCNs) proposed by Alkha Mohan et al. have specially designed inflated convolutional modules for predicting rice crop yield based on vegetation indices and climatic parameters [
12]. Chen et al. [
13], Terliksiz and Altýlar [
14], Shidnal et al. [
15], Yalcin [
16], and Kang et al. [
17] are other relevant studies have utilized CNN for crop yield prediction.
Drones equipped with multispectral cameras are capable of capturing spectral information in the green, red, red-edge, and near-infrared bands in a timely manner, providing crucial data support for the precise analysis of vegetation index characteristics [
18]. The use of multispectral data to calculate vegetation indices for estimating crop yields has significant limitations [
19,
20]. Predictions are primarily based on features such as crop vegetation indices [
21,
22,
23,
24] and crop height [
25], utilizing qualitative models like linear regression. For instance, Maimaitijiang et al. [
26], Wan et al. [
27], and Zhou et al. [
20] incorporated texture, canopy temperature, and leaf area index into the construction of crop yield prediction models, enhancing the accuracy of these models. However, as crops mature, a decline in the predictive power of these models is inevitable. To address this issue, with the increasing application of artificial intelligence technology in the agricultural sector, some researchers utilize drones equipped with multispectral cameras to capture features from multispectral images. By integrating machine learning and deep learning, they construct crop yield prediction models [
20,
26,
27,
28,
29,
30,
31]. This approach avoids the shortcomings of manual feature selection, as intelligent models possess the capability for autonomous learning, simplifying the modeling process [
32,
33]. Additionally, there is significant potential to enhance the accuracy of crop yield prediction by leveraging multispectral image information and spatial features [
11,
28,
34]. However, applying deep learning to crop yield estimation research necessitates the framework and optimization of the network structure, training parameters, and learning strategies of deep learning, tailored to crop characteristics and planting environments, to address the model’s weak interpretability. It is essential to incorporate attention mechanisms to extract key information and optimize the model, thereby enhancing its interpretability [
35]. For instance, Tian et al. [
36] proposed an LSTM neural network algorithm incorporating attention mechanisms to construct a wheat yield estimation model. Nevertheless, few researchers have constructed precise crop prediction models for small-scale planting areas in highland regions using a combination of deep learning algorithms and multispectral technology, accurately predicting crop yields in these areas.
Traditional rice yield estimation methods primarily focus on large-scale production and lack accurate prediction solutions for small-scale plateau planting (e.g., in Yunnan) and breeding scenarios. To address this issue, this study aims to propose a WT-CNN-BiLSTM hybrid model for improving the accuracy of small-scale rice yield prediction under complex plateau environments and breeding contexts. The specific objectives are as follows:
- (1)
Collect UAV-borne multispectral images (covering green, red, red-edge, and near-infrared bands) of rice throughout its growth cycle under different drip irrigation levels and measure yield data of small plots to construct a dataset;
- (2)
With CNN-LSTM as the baseline, compare common vegetation indices and screen out the optimal one for characterizing rice growth dynamics;
- (3)
Develop a model integrating WTConv, ResNet50, and BiLSTM—replace the convolutional layers in the residual blocks of ResNet50 with WTConv for multi-frequency feature extraction and use BiLSTM to capture the long-term growth trends of rice;
- (4)
Analyze the model’s performance under different irrigation levels to verify its adaptability.
The technical route of this study is illustrated in
Figure 1. This method is expected to solve the challenges of small-scale plateau rice yield prediction and support accurate yield assessment in production and breeding.
2. Materials and Methods
2.1. Study Area
The study on precision prediction of rice yield in a small area is located within Zone 2–22 of the New Agricultural Science Comprehensive Practical Teaching Base at Yunnan Agricultural University, as depicted in
Figure 2. This experimental area spans approximately 160 m
2. The soil in this area is red loam, which is the most widespread and dominant soil type in Yunnan. It is characterized by slightly acidic properties and a good water retention-permeability balance, making it well-suited for rice growth. This experimental area primarily cultivates the rice variety Dianheyou 615.
The experimental area was divided into five sub-regions by setting the drip irrigation switches to 100%, 75%, 50%, 25%, and 0%, respectively. These different irrigation levels were controlled by a unified switch to ensure that the drip irrigation volume and running time in each sub area were consistent, eliminating interference caused by inconsistent control conditions. Specifically, during the entire rice growing season, specialized personnel operate the drip irrigation switch from 8:00 a.m. to 10:00 a.m. every day, with each irrigation lasting 45 to 60 min to ensure that the target irrigation amount for each irrigation level can be fully delivered. In addition, during the collection of multispectral data, adjustable shading devices are controlled by switches to mitigate the impact of sudden strong solar radiation (such as temporary glare caused by cloud changes). This ensures that the solar radiation received by the rice canopy remains stable on different collection dates, avoiding interference with the calculation of multispectral indices.
Within the experimental area, multispectral data during the rice growth process was collected using a DJI Mavic3 MRTK multispectral drone, and rice yield data was measured for each sub-zone.
2.2. UAV Multispectral Image Acquisition
The experiment used a DJI Mavic3 multi-spectral UAV equipped with four CMOS sensors: one visible light camera (5280 × 3956 resolution) and four multi-spectral cameras (2592 × 1944 resolution), covering green (G, 560 nm ± 16 nm), red (R, 650 nm ± 16 nm), red-edge (RE, 730 nm ± 16 nm) and near-infrared (NIR, 860 nm ± 26 nm) bands. The drone is manually manned and is critical to the small-scale, complex environment of the greenhouse (e.g., avoiding nearby facilities). There is no need for fixed speed and interval settings. The flight path and the number of shots are flexibly adjusted according to the site conditions to ensure a safe and fully covered plot. Multispectral images were collected by manual flight at a height of 3 m to ensure high spatial resolution (about 0.5 cm/pixel) for small-scale plot monitoring with an overlap of more than 80% to capture subtle changes in rice canopy. The greenhouse environment was selected at a height of 3 m, with a typical height range of 3.7 ~ 4.3 m, to balance the adaptation to various greenhouse conditions while avoiding the bending of rice low-altitude flight. Other working parameters, such as manual flight mode and >80% image overlap rate, were selected to ensure precise control and dense coverage, respectively, and the data quality of the 160 m
2 area under controlled conditions was optimized. A total of 47,764 images were collected in a greenhouse at Yunnan Agricultural University, covering 30 dates of the rice growth cycle (
Table 1).
2.3. Data Collection of Rice Yield
The natural environment of the Yunnan plateau region is complex, characterized by significant topographical variations, and the shapes and sizes of the cultivated plots are irregular [
37,
38]. For instance, when cultivating rice in hilly, mountainous, and terraced areas, the cultivated areas are smaller and exhibit notable height differences [
39]. To avoid the edge effect of panicle bending and ensure sufficient sample size, the experimental plots were divided into 500 sub-plots (0.5 m × 0.5 m) 1–2 weeks before harvest in mid-to-late September 2024. This timing sequence minimizes the growth interference and realizes the backtracking matching of multi-spectral images at 30 time points. Rice is harvested from 26 September to 13 October 2024. Grain weight is measured by electronic scales in each plot, as shown in
Table 2. The unit is g, but it represents the yield per 0.25 m
2, i.e., g/0.25 m
2. For convenience, it is simply denoted as g here.
2.4. Data Processing
Due to the limited GPS data and the small area of 160 m
2, the greenhouse setting cannot be ortho-rectified and ortho-splicing. However, due to the artificial flight at 3 m height, the image overlap is high (>80%), the distortion is minimal, and the results are still reliable. From 47,764 images collected from 30 time points (4 June to 24 September 2024), 80 high-quality multispectral images (20 × 4 band: G, R, RE, NIR) were manually selected to represent 500 sub-plots for regression-based yield prediction. These images were registered in ENVI 5.6, and the G, RE and NIR bands were aligned to the R band using Tie points and manually corrected. Python 3.11 (NumPy 1.26.0, GDAL 3.6.2) was used to analyze a single image, calculate the vegetation index (NDVI, NDRE, OSAVI, RECI), convert it into a single-channel grayscale image, and divide it into 224 × 224 pixel blocks. The specific process is shown in
Section 2.5.
2.5. Dataset Construction
2.5.1. Vegetation Index Selection
In order to construct a dataset suitable for the deep learning model, we selected four vegetation indexes to calculate the vegetation index of each plot at different time points and validated their applicability through data-driven analysis (as shown in
Figure 3 and
Figure 4). These vegetation indices serve as indicators of vegetation health, particularly those related to chlorophyll content, biomass, and photosynthetic efficiency. They effectively differentiate between vegetated and non-vegetated areas and facilitate a quantitative assessment of vegetation growth conditions. The introduction of each vegetation index is summarized in
Table 3.
From the trend map of vegetation index during the whole growth period (
Figure 3), the four indexes all conform to the logical growth rhythm of rice, among which NDVI and OSAVI show a nearly coincident curve, and NDRE and RECI have a consistent change trend, which verifies their ability to characterize dynamic growth. The correlation between different growth period indexes (
Figure 4) shows that there is a close relationship between the four vegetation indexes, but the degree of correlation is different in different periods. For example, in the rapid growth stage, the correlation between some indexes is relatively high, while in the mature stage, the correlation between some indexes is reduced. This not only reflects the correlation between different indicators but also shows that each indicator can capture the unique information of rice growth in different aspects, reflecting the complementary information captured by each indicator. The use of these indices leverages their different characteristics to support yield predictions and selects them based on their established reliability in crop monitoring to complement the research focus of model development.
2.5.2. Dataset Construction Process
Based on these vegetation indices, multi-spectral and yield data were collected and corrected using ENVI 5.6 to calculate the indices at 30 time points. Then Python 3.11 and GDAL 3.6.2 were used for batch orthorectification, including: (1) reading 80 multi-spectral images, (2) creating ground control points (GCPs) from manual alignment, (3) using polynomial transformation for spatial correction, (4) obtaining corrected four-band images (G, R, RE, NIR), (5) using GDAL’s gdal for radiation correction. A conversion method for normalizing the brightness value. The vegetation index data of 500 sub-plots were processed and paired with the corresponding yield values to construct a plot-based vegetation index-yield dataset. Then data cleaning and partitioning are performed, and enhancements are performed to increase diversity: images are enhanced by 90-degree rotation and horizontal/vertical flipping, and Gaussian noise (mean = 0, standard deviation = 5%) is used to increase yield. As a result, a dataset of 2000 plots was generated, including 60,000 images and 2000 yield data points, which were then divided into training set, validation set and test set at a ratio of 7:2:1. The construction process of the dataset is shown in
Figure 5. Examples of raw samples, the final vegetation index-yield dataset (RECI), and the data-enhanced dataset are shown in
Figure 6.
2.6. Model Construction
2.6.1. ResNet50 Model
The primary enhancement of the ResNet (Residual Network) lies in the introduction of residual connections, effectively addressing the issues of gradient vanishing and network degradation encountered during the training process of deep neural networks [
44]. The architecture of ResNet50 is illustrated in
Figure 7a. It comprises multiple residual blocks, with each block consisting of a stack of convolutional (Conv) layers, batch normalization (BN) layers, and rectified linear unit (ReLU) activation functions. In residual networks, the input image initially undergoes a 7 × 7 convolution with a stride of 2, resulting in 64 output channels. Subsequently, batch normalization is performed in the BN layer, followed by a ReLU activation function. This is then processed through a 3 × 3 max pooling layer, reducing the output size to 56 × 56. The residual blocks are constructed using a bottleneck structure, involving a 1 × 1 convolution for dimensionality reduction, a 3 × 3 convolution for feature extraction, and another 1 × 1 convolution for dimensionality expansion. This design reduces the model’s parameters, thereby decreasing computational demands. Within the residual blocks, identity mapping enables gradients to propagate directly from deeper to shallower layers, effectively mitigating the issue of gradient vanishing. The formula for the residual block is presented in Equation (1). Following multiple residual blocks, an activation function is applied, followed by average pooling, and then Softmax is utilized to achieve image classification.
In this context, represents the residual function, where denotes the corresponding weight parameter for said residual function.
Figure 7.
(a) Structural diagrams of residual neural network, (b) LSTM, and (c) BiLSTM.
Figure 7.
(a) Structural diagrams of residual neural network, (b) LSTM, and (c) BiLSTM.
In this study, the ResNet50 model is employed to extract spatial features from vegetation index images. By modifying the input to a sequence of 30 channels, each containing an image with a resolution of 224 × 224 pixels, an image sequence is formed. Utilizing the deep structure of ResNet50, key spatial features such as texture and shape at each time point of rice growth are extracted. Due to the residual structure’s effectiveness in addressing the issue of gradient vanishing, the CNN model can capture spatial features at different growth stages of rice while preserving time-varying characteristics, providing rich information for subsequent time series analysis.
2.6.2. Long Short-Term Memory (LSTM) Model
Traditional Recurrent Neural Networks (RNNs) are unable to handle long time series due to the issues of gradient vanishing and explosion. The Long Short-Term Memory (LSTM) network addresses these challenges by introducing a gating mechanism. It consists of a cell state and three gating mechanisms (forget gate, input gate, and output gate) [
36,
45]. The forget gate determines which information should be forgotten from the cell state, the input gate decides which new information to store in the cell state, and the output gate controls the flow of information from the cell state to the hidden state. The calculation formula is as follows:
In this context, i, f, c, o, respectively, denote the input gate, forget gate, cell state, and output gate. W and b, respectively, represent the corresponding weights and biases; and tanh are the Sigmoid and Tanh activation functions, respectively.
The training process of LSTM is as follows: First, forward propagation is conducted according to Equations (2) to (6) to calculate the output values of the LSTM units. Subsequently, the error for each LSTM unit is computed through back propagation. Then, the gradients of each weight are calculated based on the corresponding error terms. Finally, the weights are updated using an optimization algorithm. The structure of the LSTM model is illustrated in
Figure 7b.
In practical applications, the parameter configuration of LSTM has a significant impact on model performance. Specifically, the ‘hidden_size’ denotes the dimensionality of the hidden layer, which determines the size of the hidden state and consequently affects the model’s expressive capacity and complexity. A larger ‘hidden_size’ can enhance the model’s fitting ability, but it also increases the computational burden and training time. The ‘num_layers’ indicates the number of stacked LSTM layers, typically set to 1 or 2. Multi-layer LSTM can capture more intricate sequence features, but an excessive number of layers can significantly escalate the computational cost.
In addition to LSTM, another commonly used improved model of recurrent neural networks is the gated recurrent unit (GRU). In practical applications, GRU and LSTM models are the two most frequently utilized recurrent neural network models. The GRU model is a simplified version of the LSTM model, which replaces the input gate, forget gate, and output gate in LSTM with an update gate and reset gate, and merges the cell state and output vector [
46]. This paper employs the LSTM model to further analyze image sequences processed by ResNet50, by learning the temporal features contained therein and combining them with spatial features to enhance the accuracy of yield estimation.
2.6.3. BiLSTM Model
The traditional LSTM model is adept at capturing forward information within sequences, yet it fails to fully utilize backward information. This study introduces the Bidirectional Long-Short Term Memory (BiLSTM) to address the limitations of traditional Long Short-Term Memory (LSTM) in handling complex temporal sequences [
47]. The structure of BiLSTM, as depicted in
Figure 7c, features two input directions: a forward layer and a backward layer. Both layers receive inputs from the input layer, but the data processing direction is reversed [
48]. Through this approach, the model can learn local features at each time step and capture bidirectional characteristics at every point in the sequence, thereby facilitating a more comprehensive understanding of both preceding and succeeding information.
By introducing the BiLSTM model, we overcome the limitations of the unidirectional LSTM model when processing vegetation index sequence images of the rice growth cycle. BiLSTM processes data in both directions, enabling the model to more precisely identify key growth turning points and trends, distinguish between early rapid growth and late maturation stages, and enhance the accuracy of yield estimation.
2.6.4. WTConv Module
WTConv (Wavelet Convolutions) is a convolutional module based on the wavelet transform, as described in [
49]. It utilizes wavelet transformations to expand the receptive field of convolutional neural networks. By performing convolutions with small convolutional kernels on different frequency bands, it enhances the model’s ability to simultaneously focus on low-frequency and high-frequency information, thereby improving its response to shape and texture features in images. By effectively increasing the receptive field of convolution through the utilization of signal processing tools, the model becomes more accurate and efficient in extracting image features.
Commonly used wavelet basis functions in WTConv include the Haar wavelet and the Daubechies wavelet. Wavelet basis functions facilitate multiscale analysis of signals by decomposing the input signal into low-frequency and high-frequency components [
50]. The low-frequency components primarily capture the overall shape and contour of the image, while the high-frequency components are utilized to extract detailed and edge information from the image. In this paper, we employ the Daubechies wavelet (db1) as the wavelet basis function for WTConv. The db1 wavelet efficiently processes signals, preserving image details and edge information, while reducing redundant data through multiscale feature extraction and compression efficiency [
51]. This article introduces WTConv to enhance the performance of CNN models in extracting image features of rice growth cycles, particularly in the task of rice yield estimation. By augmenting the model’s feature representation capabilities, it enables the model to more effectively learn the temporal and spatial features within vegetation index sequence images.
2.6.5. WT-CNN-BiLSTM Model
Based on the aforementioned theory, this study proposes a composite model that integrates wavelet convolution, residual neural networks, and bidirectional long short-term memory networks. The model structure is illustrated
Figure 8. The model accepts a sequence of rice growth cycle images containing 30 channels as input and then performs feature extraction on the ResNet50 network. In ResNet50, the convolutional layer in the first residual block is replaced with a WTConv layer to expand the receptive field and extract multi-frequency band features. Subsequently, the final fully connected layer of the residual neural network is removed, and the extracted feature maps are input into the BiLSTM network. This allows for the learning of the dynamic characteristics of these features over time, thereby capturing the long-term trends in rice growth. Finally, the BiLSTM model processes the output through a fully connected layer, ultimately yielding the predicted value of rice yield.
2.7. Model Parameter Setting
The experimental hardware environment consists of an Intel Core i9-13900K processor and an NVIDIA RTX 2080Ti graphics card, while the software environment is equipped with Python 3.11 and the Pytorch 2.0.1 framework. During the model training phase, the parameter settings were optimized through multiple experiments as follows: the number of training epochs (Epochs) is set to 500, the batch size (Batch Size) is 64, and the initial learning rate is 0.0001. The input and output channels of the CNN section are 30 and 64, respectively, with a convolution kernel size of 7; the WTConv module employs a 3 × 3 convolution kernel based on a one-layer decomposition of the Daubechies wavelet transform. The input feature dimension of the BiLSTM network is 2048, with 2 hidden layers and 256 neurons per layer. The optimizer selected is Adam, and the loss function is the mean squared error (MSE). To prevent overfitting, the model performance is validated using the validation set after each training epoch and evaluated using the test set. If the loss on the validation set does not decrease for 50 consecutive epochs, training is terminated prematurely.
2.8. Model Validation
To validate the predictive accuracy of a model, metrics such as Root Mean Square Error (RMSE), Mean Absolute Percentage Error, and Coefficient of Determination are commonly employed to assess model performance. Their respective formulas are presented in Equations (7)–(9).
In this context, m denotes the number of samples, represents the actual yield value, signifies the predicted yield value, and indicates the average yield value.
3. Results
3.1. Comparative Experiment on Vegetation Index Performance
To identify the optimal vegetation indices suitable for small-scale yield estimation, this study employed the CNN-LSTM model as the baseline and compared the predictive performance of four vegetation indices: NDVI, NDRE, OSAVI, and RECI, under unified dataset partitioning conditions.
As shown in
Table 4, the RECI vegetation index significantly outperforms other vegetation indices in terms of Root Mean Squared Error (RMSE) (14.99 g, 13.14 g), Mean Absolute Percentage Error (MAPE) (18.29%, 15.16%), and R-squared (R
2) values (0.80, 0.86) on both the validation and test sets—this superiority is attributed to RECI’s high sensitivity to rice canopy chlorophyll content, a key factor in yield formation. The loss curves for the training and validation sets in
Figure 9 indicate that the loss curves for the four vegetation index-yield datasets gradually decrease with increasing epochs. However, the validation loss curves for NDVI, NDRE, and OSAVI exhibit significant oscillations after 400 epochs (NDVI/OSAVI affected by soil background, NDRE saturated at high biomass), suggesting overfitting. In contrast, the validation loss curve for the RECI dataset shows no significant oscillations (adapts to chlorophyll changes during grain filling), indicating a more stable training process.
Furthermore, as observed from the scatter plot of yield prediction values and actual values in
Figure 10, within the primary yield range of 0–125 g (More than 90% of the sample), the prediction points of RECI are more aligned with the diagonal—this reflects RECI’s better adaptation to the typical yield range of small-scale plots. Although OSAVI demonstrates local advantages in the high yield segment exceeding 170 g, the sample size in this range is limited. Overall, RECI demonstrates a more stable predictive capability within the typical small-area yield range, making it more suitable for small-scale rice yield estimation.
3.2. WT-CNN-BiLSTM Model Performance Evaluation
To investigate the impact of bidirectional structures, residual connections, and wavelet transform modules on the predictive performance of rice yield, this study employed CNN-LSTM as the baseline model. On the RECI-Yield dataset, a systematic comparison was conducted between the predictive performance of CNN-BiLSTM, CNN-GRU, and its enhanced version, WT-CNN-BiLSTM. Additionally, ablation experiments were performed focusing on the two key modules: the residual structure and wavelet convolution.
According to
Table 5, the proposed WT-CNN-BiLSTM model, when compared to the baseline CNN-LSTM model, demonstrated a significant reduction in the Root Mean Squared Error (RMSE) on the test set, decreasing from 13.14 g to 9.68 g, representing a 26.33% decrease. The Mean Absolute Percentage Error (MAPE) decreased from 15.16% to 11.41%, marking a 24.74% reduction. Additionally, the coefficient of determination (R
2) increased from 0.86 to 0.92, indicating a 6% improvement. These results suggest that the proposed model exhibits remarkable performance enhancement in predicting rice yield.
Through ablation experiments conducted on the proposed model, we observed the impact of the residual structure and the wavelet convolution module on model performance. When the residual structure was added to the CNN-BiLSTM model without it, the RMSE on the test set decreased from 10.78 g to 10.33 g, representing a 4.17% reduction; the MAPE decreased from 11.94% to 11.67%, a 2.26% decrease; and the R2 increased from 0.90 to 0.91, a 1% improvement. When the wavelet convolution module was added separately, the RMSE on the test set decreased from 10.78 g to 9.96 g, a 7.61% reduction; the MAPE decreased from 11.94% to 11.63%, a 2.60% decrease; and the R2 increased from 0.90 to 0.92, a 2% increase. When both modules were added, the proposed WT-CNN-BiLSTM model exhibited the best performance, with the RMSE on the test set decreasing from 13.14 g to 9.68 g, a 26.33% reduction; the MAPE decreased from 15.16% to 11.41%, a 24.74% decrease; and the R2 increased from 0.86 to 0.92, a 6% increase. This synergy makes yield estimation more robust, which is critical for optimizing irrigation and inputs in small plots.
Based on the accuracy curve depicted in
Figure 11 and the loss curve illustrated in
Figure 12, the proposed WT-CNN-BiLSTM model demonstrates superior performance. It maintains a high R
2 value throughout the later stages of training epochs and exhibits less overfitting compared to models lacking a residual structure and wavelet convolution. Additionally, as observed from the scatter plot in
Figure 13, after removing the residual structure from CNN-BiLSTM (CNN-BiLSTM-NoResidual), an increased number of high error points were observed within the yield range of 75–125 g. However, incorporating wavelet convolution (WT-CNN-BiLSTM-NoResidual) on this basis reduced the number of error points. This further underscores the significance of the residual structure and wavelet convolution modules in enhancing the predictive accuracy of the model.
To further validate the generalization capabilities of the proposed model, a new dataset, named RECI-Yield-VT, was constructed using validation and test sets that were not involved in the training process. This dataset was divided into a training set and a validation set in a 4:1 ratio. Cross-validation was conducted on each model based on this dataset to assess its performance under different data distributions. During the cross-validation process, to expedite the validation and enhance experimental efficiency, the best-performing model that had already been trained was selected for transfer learning, further verifying its generalization capabilities on the new dataset. The cross-validation process employed a 5-fold cross-validation method, with 10 epochs of training, and the model parameters remained consistent with those used in the previous training.
Based on the results presented in
Table 6, the WT-CNN-BiLSTM model demonstrated superior performance in the validation set, achieving an RMSE of 8.07 g, representing a 13.55% reduction compared to the benchmark CNN-LSTM model. Additionally, significant improvements were observed in both the MAPE and R
2 metrics. When compared to the model without residual connections (CNN-BiLSTM-NoResidual) and the model lacking both residual connections and wavelet convolution (WT-CNN-BiLSTM-NoResidual), the WT-CNN-BiLSTM model exhibited an RMSE of 8.07 g in the validation set, representing reductions of 9.12% and 5.94%, respectively. These findings further validate the pivotal role of residual connections and wavelet convolution modules in enhancing the precision of small-scale yield prediction.
3.3. The Impact of Irrigation Levels on Model Performance
To investigate the impact of varying irrigation levels on the accuracy and reliability of rice yield prediction, five novel sub-test sets were derived from the test set based on the previously delineated sub-regions. By evaluating the model’s predictive performance across these five independent sub-test sets, we conducted a comparative analysis of the model’s performance under different irrigation levels. Notably, the results of the original complete test set—incorporated as “All irrigation level” in the statistics—were also included to serve as an overall reference benchmark. The experimental results are presented in
Table 7.
Table 7 reveals a distinct nonlinear relationship between irrigation levels and rice yield. It is noteworthy that yield does not monotonically increase or decrease with the escalation of irrigation levels; rather, there exists an optimal range for irrigation. Specifically, when the irrigation level is at 50%, the yield attains its peak value (7329.16 g). However, both excessively high (100%) and low (0%) irrigation levels significantly reduce yield. The “All irrigation level” category, with a total yield of 29,038.46 g, represents the aggregated yield of the original test set covering all irrigation scenarios, providing a comprehensive overview of yield performance across the full irrigation spectrum. This phenomenon suggests that excessively high irrigation may lead to soil waterlogging, thereby affecting root development and nutrient absorption. Conversely, low irrigation levels directly constrain the water supply to rice, thus inhibiting its growth potential.
Under varying irrigation levels, the predictive performance of the WT-CNN-BiLSTM model exhibits distinct differences. The model achieves high prediction accuracy at irrigation levels of 50% and 25%, with an R2 value of 0.91 on the test set for both, indicating its effectiveness in capturing yield variation trends under these conditions. In contrast, the prediction accuracy of the model is slightly lower at irrigation levels of 100% and 0%, with R2 values of 0.89 and 0.90 on the test set, respectively. This variation may be attributed to the limited sample size at extreme irrigation levels, leading to instability in the model’s performance under these conditions.
Additionally, the model exhibits the lowest prediction accuracy among all single irrigation levels at 75% (R2 = 0.87), which may be linked to greater yield fluctuations in this irrigation range. Of particular note, the model achieves the highest R2 value (0.92) under the “All irrigation level” category—surpassing its performance in all single irrigation level sub-test sets. This superiority arises from the more comprehensive sample distribution and richer feature information in the original complete test set, enabling the model to learn both common patterns and unique characteristics of yield changes across different irrigation conditions.
4. Discussion
4.1. Performance Comparison of Different Vegetation Indices in the CNN-LSTM Model
The comparison of vegetation indices in the model shows that under greenhouse conditions in the Yunnan Plateau, the RECI is the optimal choice for small-scale rice yield prediction. This is because red-edge vegetation indices are sensitive to the red-edge and near-infrared bands, enabling them to capture the dynamic changes in chlorophyll [
52]. RECI exhibited superior performance on the test set, with a high coefficient of determination (R
2 = 0.86) and a low Root Mean Squared Error (RMSE = 13.14 g), supporting its application in precise yield estimation.
Furthermore, in small-scale scenarios, single-vegetation-index modeling was adopted to avoid redundancy of vegetation indices, overfitting with limited samples, and high computational requirements—making the model usable as a basic tool [
53,
54,
55]. Training with a low learning rate and more training epochs ensured the compatibility of RECI, reducing oscillation in validation loss, while other indices were more prone to overfitting under the same conditions.
Utilizing multispectral vegetation indices can significantly improve the accuracy of yield prediction. However, existing studies have mainly focused on common vegetation indices such as the NDVI, NDRE, and Normalized Difference Yellowness Index (NDYI), with few studies using RECI [
27,
56,
57,
58]. Nevertheless, compared with other similar studies on rice yield prediction using UAV-borne multispectral data, the prediction results of this study in small plots are consistent—rice yield prediction accuracy can be improved through UAV multispectral images and artificial intelligence algorithms. Despite RECI’s certain advantages, further verification and optimization are still required for its practical application.
Meanwhile, in large-scale estimation, multi-vegetation-index fusion is commonly used to achieve broader coverage. However, small-scale applications face challenges in data processing and visualization, as they require higher resolution, greater variability, and more refined geographic information to meet reconstruction needs [
59,
60].
4.2. Performance Evaluation of WT-CNN-BiLSTM and Baseline Models
The WT-CNN-BiLSTM model outperformed the baseline models on the RECI dataset, benefiting from bidirectional processing, residual connections, and wavelet modules, which collectively enhanced its ability to capture features. Similar improvements have also been observed in crop yield models using UAV-borne multispectral data, where yield prediction performance was further improved by optimizing models [
31,
56,
61]. Ablation experiments in this study confirmed the importance of the synergistic effect of these modules in enhancing yield prediction performance.
For yield prediction in small greenhouse areas, the current focus is mainly on improving prediction precision. In contrast, common yield visualization is typically conducted at higher flight altitudes for large-scale yield prediction tasks [
62], with few applications in greenhouse environments. This poses new challenges for yield prediction in greenhouse rice breeding tasks. Due to the limited space, high crop density, and short growth cycle of greenhouse environments, traditional yield visualization methods based on high-altitude remote sensing are hardly applicable.
Future research will focus on developing close-range multispectral imaging systems and high-precision modeling algorithms suitable for small-scale greenhouse scenarios to achieve accurate estimation and distribution visualization of yield for individual rice plants or small plots. This will not only facilitate the rapid screening of high-yield traits during breeding but also provide data support for intelligent greenhouse management, promoting the development of rice breeding towards digitalization and high-throughput.
4.3. Impact of Irrigation Levels on WT-CNN-BiLSTM Performance and Greenhouse Rice Yield
Irrigation levels significantly affected both rice yield and model performance. The nonlinear relationship between irrigation and yield reflected the physiological characteristics of rice: yield reached its optimum at moderate irrigation levels and decreased under extreme levels due to stress. The superior R2 (0.92) of the complete dataset highlighted the role of diverse samples in enhancing model stability, providing guidance for precision irrigation to improve water use efficiency.
This finding aligns with rice research on irrigation-yield dynamics, where maximum yield is achieved at moderate irrigation levels [
63,
64]. The model proposed in this study indirectly captured drip irrigation information through multispectral data, and the accuracy of yield prediction was improved through diverse samples. In the future, prediction accuracy will be further enhanced by integrating multimodal data such as meteorological, soil, and management factors.
4.4. Limitations and Future Work
This study used data from a single controlled greenhouse environment (one location, one growing season, and one rice cultivar) to isolate the impact of irrigation on yield without interference from external variables such as weather fluctuations. However, this limits the model’s general applicability under diverse field conditions, as greenhouses provide a stable environment that differs from variable open-field conditions. For instance, a limitation of this study is the lack of spatial visualization of predicted yields due to the greenhouse environment. Future studies will extend to outdoor fields in Yunnan′s complex terrain, leveraging automated reconstruction tools (e.g., DJI Terra) to create orthomosaics and visualize predicted yields spatially. Nevertheless, the constructed model will have better applicability for breeding tasks in greenhouse environments on the Yunnan Plateau, as results remain controllable under consistent greenhouse conditions and soil types.
While the model showed promising performance in greenhouse rice yield estimation, its credibility in broad applications is reduced by the limitations of single-location and controlled data, which lack real-world variability. Future work should expand to multiple environments, growing seasons, and cultivars, and incorporate multimodal data (e.g., irrigation amount, meteorological indicators, and soil properties) to improve accuracy.
For adaptation to other crops such as wheat, only minimal modifications to data collection and model construction are required. For example, additional steps should be added to data collection to control UAV altitude, avoiding data distortion or crop damage; model parameters—primarily the wavelet basis (e.g., adjusting from db1 to db2)—should be tuned to better extract features while maintaining consistency in other aspects. For different crops, scalability can be further enhanced by integrating more indices and automating workflows.
5. Conclusions
This study proposes a WT-CNN-BiLSTM hybrid model to address the yield prediction issue of traditional methods in small-scale plateau rice cultivation and breeding scenarios. First, UAV multispectral images of rice throughout the entire growth period under 5 drip irrigation levels were collected, and yield data from 500 small plots (0.5 m × 0.5 m) were measured to construct a dedicated dataset. Second, with CNN-LSTM as the baseline, 4 common vegetation indices were compared, and RECI was identified as the optimal one (test set: R2 = 0.86, RMSE = 13.14 g), which avoids the overfitting and validation loss fluctuations that occur with NDVI, NDRE, and OSAVI. Subsequently, the WT-CNN-BiLSTM model integrating WTConv, ResNet50, and BiLSTM was developed. By replacing the convolutional layer of the ResNet50 residual block with WTConv, multi-frequency feature extraction was enhanced; meanwhile, BiLSTM was combined to capture the long-term growth trends of rice. The test set RMSE of this model decreased by 26.33% compared to the baseline CNN-LSTM, reaching 9.68 g, with MAPE dropping to 11.41% and R2 increasing to 0.92. Cross-validation (RMSE = 8.07 g, R2 = 0.94) verified its generalization ability. Finally, an analysis of the model’s performance under different irrigation levels revealed that rice yield peaked at a moderate drip irrigation amount of 50% (7329.16 g). The model achieved an R2 of 0.91 under both 50% and 25% irrigation levels, and an R2 of 0.92 under full irrigation level, exhibiting good adaptability. Although the model is limited by data from a single greenhouse and a single rice variety and lacks spatial visualization of yield, it still fills the gap in yield prediction for small-scale plateau rice and provides technical support for accurate yield assessment in rice production and breeding.