Estimation and Forecasting of Rice Yield Using Phenology-Based Algorithm and Linear Regression Model on Sentinel-II Satellite Data

: Rice is a primary food for more than three billion people worldwide and cultivated on about 12% of the world’s arable land. However, more than 88% production is observed in Asian countries, including Pakistan. Due to higher population growth and recent climate change scenarios, it is crucial to get timely and accurate rice yield estimates and production forecast of the growing season for governments, planners, and decision makers in formulating policies regarding import/export in the event of shortfall and/or surplus. This study aims to quantify the rice yield at various phenological stages from hyper-temporal satellite-derived-vegetation indices computed from time series Sentinel-II images. Different vegetation indices (viz. NDVI, EVI, SAVI, and REP) were used to predict paddy yield. The predicted yield was validated through RMSE and ME statistical techniques. The integration of PLSR and sequential time-stamped vegetation indices accurately predicted rice yield (i.e., maximum R 2 = 0.84 and minimum RMSE = 0.12 ton ha − 1 equal to 3% of the mean rice yield). Moreover, our results also established that optimal time spans for predicting rice yield are late vegetative and reproductive (ﬂowering) stages. The output would be useful for the farmer and decision makers in addressing food security.


Introduction
The rapid increase in the world population exerts pressure on the agriculture sector and threatening the food security of the world [1]. Among cereals, rice is one of the prime sources of food with high nutritive value (i.e., containing carbohydrate, vitamins (B, E, thiamine), and minerals (Ca, Mg, Fe). Rice is widely grown, consumed globally (i.e., daily food of 3.5 billion people worldwide), and accounts for 19% of the dietary energy [2]. Globally, 90% of the rice comes from Asia, which is approximately 640 million tons per annum [3,4]. Pakistan ranks 11th at the global rice production list and contributes 8% to the world's total rice trade [3]. Pakistan produced seven (7) million tons of rice in the year most critical growth period (phenological stage) for quantification of rice yield with hyper-temporal sentinel-II and derived-indices using Partial Least Square Regression (PLSR) model to improve the estimation accuracy of rice yield by remote sensing.

Study Area
This study was carried out at Sheikhupora district (Latitude 31°30′00" to 31°65′30" N and Longitude 73°40′00" to 74°23′00" E) of Punjab, Pakistan ( Figure 1). Sheikhupora district is located between Ravi and Chenab rivers and irrigated by river water from two canals (Upper Chenab and Khanpur Canals). Climatically the region is dominated by the wet monsoon, thus making it favorable for rice crop. The annual precipitation ranges from 120 to 720 mm, which mainly occurs between July and August [28]. The study area is dominated by alluvial clay and loamy soil rich in humus and mineral composition. The mineralogical, chemical and geotechnical compositions of soil (pH = 8, EC = 1.1−4.5 dS m −1 with soil field capacity from 45-71%) make the region ideal for rice cultivation [29]. Due to the favorable conditions, Sheikhupora district is the second largest rice-producing district in Pakistan with an average production of 2-2.5 million tons annually [5,30].

Field Data
Stratified-random-sampling procedure was used to collect data from 137 plots well distributed in the study area. The rice fields with minimum size of 60 × 60 m (i.e., corresponding to the coarse pixel size of satellite images used in this study) were considered for the purpose of sampling, monitoring, and analysis. In most of the paddy fields, the rice crop was transplanted in start of July (after seed sown in the nursery at the start of June). These sampled plots were carefully monitored from transplanting till harvesting period. The rice produced in each plot ( Figure 2) was carefully measured and recorded for further analysis.

Data Collection 2.2.1. Field Data
Stratified-random-sampling procedure was used to collect data from 137 plots well distributed in the study area. The rice fields with minimum size of 60 × 60 m (i.e., corresponding to the coarse pixel size of satellite images used in this study) were considered for the purpose of sampling, monitoring, and analysis. In most of the paddy fields, the rice crop was transplanted in start of July (after seed sown in the nursery at the start of June). These sampled plots were carefully monitored from transplanting till harvesting period. The rice produced in each plot ( Figure 2) was carefully measured and recorded for further analysis.

Satellite Data Preprocessing and Vegetation Index
The optical remote sensing data of Sentinel-II satellite for year 2016 was used in this study. The availability of multiple spectral bands in the red, red edge, and near infrared (NIR) part of the EM spectrum with 10-20 m resolution makes Sentinel-II an ideal choice for studying vegetation phenology and monitoring agricultural crops for stress level, nutrient contents, pest attack, and yield estimation [31][32][33]. The cloud free Sentinel-2 timeseries images (i.e., spanning from growing to harvesting phase) were pre-processed for atmospheric correction using Sen2Cor processing. The atmospherically corrected images were then used for computing vegetation indices (Table 1) from times series satellite images spanning from sowing till harvesting period of the rice crop. Vegetation indices are mathematical transformations using two or more spectral bands devised to enhance certain characteristics of vegetation [34]. Several images to cover the entire growth period of rice crop were used and vegetation indices were computed using Google Earth Engine (GEE). ) − ( ) ) + ( ) + inimize the brightness effect of soil Huete (1988) [36] ( ) − ( ) 1 × ( ) − 2 × ( ) + 0.5 (Soil adjusted factor); C1 and C2 are constants to reduce aerosols effects. Liu and Huete (1995) [37] represent interpolation constants that can be adjusted according to available band's

Satellite Data Preprocessing and Vegetation Index
The optical remote sensing data of Sentinel-II satellite for year 2016 was used in this study. The availability of multiple spectral bands in the red, red edge, and near infrared (NIR) part of the EM spectrum with 10-20 m resolution makes Sentinel-II an ideal choice for studying vegetation phenology and monitoring agricultural crops for stress level, nutrient contents, pest attack, and yield estimation [31][32][33]. The cloud free Sentinel-2 time-series images (i.e., spanning from growing to harvesting phase) were pre-processed for atmospheric correction using Sen2Cor processing. The atmospherically corrected images were then used for computing vegetation indices (Table 1) from times series satellite images spanning from sowing till harvesting period of the rice crop. Vegetation indices are mathematical transformations using two or more spectral bands devised to enhance certain characteristics of vegetation [34]. Several images to cover the entire growth period of rice crop were used and vegetation indices were computed using Google Earth Engine (GEE).

Geo-Statistical Analysis
The PLSR is an established multivariate analysis technique commonly used in chemomatric and hyperspectral data analysis [39,40]. The Partial Least Square Regression (PLSR) analyses were performed to all time-series vegetation indices (explanatory variables) with rice yields (response variable). In PLSR model development, the selection of optimum number of latent variables (LVs) is more critical, as increase of the number of LVs would improve the accuracy of the model, while selection of too many variables can lead to the over fitting and the error would increase [41]. To minimize this over fitting problem, the optimal number of LVs was selected based on achieving a combination of a high R 2 and a low root mean squared error of the prediction (RMSEP) (Figure 3). The PLSR model was evaluated by plotting the 1:1 relationship graph between the predicted and measured values of the yield (Figure 3). The evaluation indices were the R 2 and the RMSE. A larger R 2 shows that the model is better, while smaller RMSE values indicate the stronger estimation ability of the model. To evaluate the performances of the prediction models, leave-one-out cross-validation [42] was used, in which the model was iteratively trained on multiple time series data and then used to predict yield. PLSR can be mathematically expressed as: where Y is response variable (rice yield), X 1 , X 2 -X n are the selected latent variables (LVs), which are the time series images, in this case a is the intercept and b 1 -b n represent the regression coefficients (also known as β-coefficients) for different predictors.
Agriculture 2021, 11, x FOR PEER REVIEW 5 of 1 (PLSR) analyses were performed to all time-series vegetation indices (explanatory varia bles) with rice yields (response variable). In PLSR model development, the selection o optimum number of latent variables (LVs) is more critical, as increase of the number o LVs would improve the accuracy of the model, while selection of too many variables can lead to the over fitting and the error would increase [41]. To minimize this over fitting problem, the optimal number of LVs was selected based on achieving a combination of a high R 2 and a low root mean squared error of the prediction (RMSEP) (Figure 3). The PLSR model was evaluated by plotting the 1:1 relationship graph between the predicted and measured values of the yield ( Figure 3). The evaluation indices were the R 2 and the RMSE A larger R 2 shows that the model is better, while smaller RMSE values indicate the stronger estimation ability of the model. To evaluate the performances of the prediction models, leave-one-out cross-validation [42] was used, in which the model was iteratively trained on multiple time series data and then used to predict yield. PLSR can be mathe matically expressed as: where Y is response variable (rice yield), X1, X2-Xn are the selected latent variables (LVs) which are the time series images, in this case a is the intercept and b1-bn represent the regression coefficients (also known as β-coefficients) for different predictors.

Spatial Distribution and Mapping of Rice Yield
To model the spatial distribution of rice yields, a two-step procedure was adopted The thematic map of rice crop was developed using phenological based mapping algo rithms ( Figure 4). In this routine, the phenology profiles (or signatures) serves as numer ical key for discerning different crop types grown in the region of interest [43,44]. The time series vegetation indices (e.g., NDVI profiles computed from optical data for the entire growth span of rice crop: 130 days) were used to demarcate the rice crop and to compute the rice cultivated area. The phenological-mapping-routine takes into account the entire range (minimum-maximum) of vegetation index values (e.g., NDVI in this case) at each time stamp (e.g., from transplanting till harvesting) and can be mathematically expressed as: R = (NDVI min and NDVI max) and (NDVI min and NDVI max)NDVI min and NDVI max) (2 After a certain number of latent variables, the decrease in RMSE was negligible and that was taken as the optimum number of variables for PLSR model development.

Spatial Distribution and Mapping of Rice Yield
To model the spatial distribution of rice yields, a two-step procedure was adopted. The thematic map of rice crop was developed using phenological based mapping algorithms ( Figure 4). In this routine, the phenology profiles (or signatures) serves as numerical key for discerning different crop types grown in the region of interest [43,44]. The time series vegetation indices (e.g., NDVI profiles computed from optical data for the entire growth span of rice crop: 130 days) were used to demarcate the rice crop and to compute the rice cultivated area. The phenological-mapping-routine takes into account the entire range (minimum-maximum) of vegetation index values (e.g., NDVI Agriculture 2021, 11, 1026 6 of 14 in this case) at each time stamp (e.g., from transplanting till harvesting) and can be mathematically expressed as: R = (NDVI 1 min and NDVI 1 max) and (NDVI 2 min and NDVI 2 max) NDVI n min and NDVI n max) (2) where R represents response variable (rice), and NDVI 1 , NDVI 2 , and NDVI n represent ND-VIs derived from RS data at different crop phenological stages starting from transplanting till harvesting.  To develop the spatial distribution maps of rice yield, the models of statistical analysis (the regression equations of PLSR analysis) were inverted to the time series vegetation indices of rice masked areas (developed from phenological based mapping) and were validated with an independent dataset.

Rice Yield Estimation (Field Level Data)
The statistical descriptions of the measured rice yield are summarized in Table 2. In calibration datasets (with sample size n = 96), the rice yield varies between 3.06 ton/ha and 4.15 ton/ha with mean equal to 3.70 ton/ha. The standard deviation (SD) was ±0.31 ton/ha and coefficient of covariance (CV) equal to 0.083 ton/ha. The graphical display reflects that the calibration dataset is near normally distributed. The summary statistics of validation sets (n = 41) shows that rice yield ranges between 3.16 ton/ha and 4.15 ton/ha with average equal to 3.71 ton/ha. The standard deviation (SD) To develop the spatial distribution maps of rice yield, the models of statistical analysis (the regression equations of PLSR analysis) were inverted to the time series vegetation indices of rice masked areas (developed from phenological based mapping) and were validated with an independent dataset.

Rice Yield Estimation (Field Level Data)
The statistical descriptions of the measured rice yield are summarized in Table 2. In calibration datasets (with sample size n = 96), the rice yield varies between 3.06 ton/ha and 4.15 ton/ha with mean equal to 3.70 ton/ha. The standard deviation (SD) was ±0.31 ton/ha and coefficient of covariance (CV) equal to 0.083 ton/ha. The graphical display reflects that the calibration dataset is near normally distributed.

Variation in Temporal Profiles of Vegetation Indices with Rice Phenology
The temporal profiles of vegetation index spanning across the full growth period of rice crop are shown in Figure 5. The vegetation index (e.g., NDVI) values were least (minimum) at transplanting phase and showed gradual increase with increase in vegetative parts ( Figure 5). The vegetation indices reached at peak in the late vegetative phase and continually maintained high values (e.g., spectral plateau) till flowering phase. At post flowering phase (e.g., ripening phase), the vegetation index values started declining and reached its minimum at fully ripened harvesting phase (see Figure 5).

Variation in Temporal Profiles of Vegetation Indices with Rice Phenology
The temporal profiles of vegetation index spanning across the full growth period of rice crop are shown in Figure 5. The vegetation index (e.g., NDVI) values were least (minimum) at transplanting phase and showed gradual increase with increase in vegetative parts ( Figure 5). The vegetation indices reached at peak in the late vegetative phase and continually maintained high values (e.g., spectral plateau) till flowering phase. At post flowering phase (e.g., ripening phase), the vegetation index values started declining and reached its minimum at fully ripened harvesting phase (see Figure 5). The summary statistics of validation sets (n = 41) shows that rice yield ranges between 3.16 ton/ha and 4.15 ton/ha with average equal to 3.71 ton/ha. The standard deviation (SD) was ±0.29 and coefficient of covariance (CV) equal to 0.078 ton/ha. The validation data are also normally distributed (Table 2: see the graphical display at last column).

Variation in Temporal Profiles of Vegetation Indices with Rice Phenology
The temporal profiles of vegetation index spanning across the full growth period of rice crop are shown in Figure 5. The vegetation index (e.g., NDVI) values were least (minimum) at transplanting phase and showed gradual increase with increase in vegetative parts ( Figure 5). The vegetation indices reached at peak in the late vegetative phase and continually maintained high values (e.g., spectral plateau) till flowering phase. At post flowering phase (e.g., ripening phase), the vegetation index values started declining and reached its minimum at fully ripened harvesting phase (see Figure 5).

Prediction of Rice Yield and the Performance of Vegetation Indices
The prediction of rice yields using PLSR and time series vegetation indices are summarized Table 3. The integration of PLSR and sequential-time-stamped-vegetation indices were found effective for quantifying rice yields. The time series NDVI yielded the highest R 2 = 0.83 (lowest RMSEcv = 0.12 ton/ha) followed by EVI (R 2 = 0.80, RMSEcv = 0.14 ton/ha), SAVI (R 2 = 0.79, RMSEcv = 0.14 ton/ha), and REP (R 2 = 0.64, RMSEcv = 0.17 ton/ha). The performance of different indices (NDVI, SAVI, EVI, REP) for rice yield estimation were consistent for both calibration and validation datasets (Table 3, Figure 6c,f,i,l).

Prediction of Rice Yield and the Performance of Vegetation Indices
The prediction of rice yields using PLSR and time series vegetation indices are summarized Table 3. The integration of PLSR and sequential-time-stamped-vegetation indices were found effective for quantifying rice yields. The time series NDVI yielded the highest R 2 = 0.83 (lowest RMSE cv = 0.12 ton/ha) followed by EVI (R 2 = 0.80, RMSE cv = 0.14 ton/ha), SAVI (R 2 = 0.79, RMSE cv = 0.14 ton/ha), and REP (R 2 = 0.64, RMSE cv = 0.17 ton/ha). The performance of different indices (NDVI, SAVI, EVI, REP) for rice yield estimation were consistent for both calibration and validation datasets (Table 3, Figure 6c,f,i,l).  a,d,g,j). After certain number of latent variables, the decrease in RMSE was negligible and that was taken as the optimum number of variables for PLSR model development. The temporal profiles of vegetation indices are similar in shape except REP (Panels b,e,h,k). The regression coefficients lines show that sowing, late vegetative, flowering and ripening are important phases for predicting rice yields using PLSR. The measured vs. predicted (Panels c,f,i,l) manifest that yield was best predicted using temporal NDVI data (yielded highest R 2 and lowest RMSE).
Using the time series profiles of NDVI, EVI and SAVI, the PLSR models selected six latent variables ( Figure 6, Table 3). The selected six latent variables explained most of the variance (e.g., as the case of NDVI where R 2 = 0.83) and the addition of further variables hardly improve the model performance (e.g., the total 21 variables yield maximum R 2 of 0.84) (Figure 6a,d,g,j). The important latent variable (in this study, time stamped vegetation indices) was located at late vegetative, reproductive (flowering), and ripening phases of rice growth (Figure 6b,e,h,k). Using time series Red Edge Position (REP) data, the number of selected latent variables were five (05) with high regression coefficient (or B-coefficient) values at flowering, ripening, and late vegetative phases (Figure 6c).    a,d,g,j). After certain number of latent variables, the decrease in RMSE was negligible and that was taken as the optimum number of variables for PLSR model development. The temporal profiles of vegetation indices are similar in shape except REP (Panels b,e,h,k). The regression coefficients lines show that sowing, late vegetative, flowering and ripening are important phases for predicting rice yields using PLSR. The measured vs. predicted (Panels c,f,i,l) manifest that yield was best predicted using temporal NDVI data (yielded highest R 2 and lowest RMSE).
Using the time series profiles of NDVI, EVI and SAVI, the PLSR models selected six latent variables ( Figure 6, Table 3). The selected six latent variables explained most of the variance (e.g., as the case of NDVI where R 2 = 0.83) and the addition of further variables hardly improve the model performance (e.g., the total 21 variables yield maximum R 2 of 0.84) (Figure 6a,d,g,j). The important latent variable (in this study, time stamped vegetation indices) was located at late vegetative, reproductive (flowering), and ripening phases of rice growth (Figure 6b,e,h,k). Using time series Red Edge Position (REP) data, the number of selected latent variables were five (05) with high regression coefficient (or B-coefficient) values at flowering, ripening, and late vegetative phases (Figure 6c).

Spatial Varability in Rice Yield Potential
The distribution map (developed from best predicting PLSR model, time series vegetation indices and map of rice grown area) reflects that rice yield distribution varies in space and ranges between 1.5 to 4.2 ton/ha (Figure 7a). The upper limit of rice yields (i.e., 4.20 ton/ha) in the distribution map were closely matching with the ground measured yield (4.15 ton/ha). The minimum limit was underestimated in the spatial distribution map of rice yield (1.5 ton/ha) compared to the ground measured minimum yield (3.06 ton/ha). The validation of spatial distribution maps against independent ground measured yield data (30% of the total samples) confirms that yield was predicted with high accuracy (see Figure 7b). The high R 2 (0.83) and low RMSE (0.14 tons/ha) manifested a close match between measured and predicted rice yields (Figure 7b). yield (4.15 ton/ha). The minimum limit was underestimated in the spatial distribution map of rice yield (1.5 ton/ha) compared to the ground measured minimum yield (3.06 ton/ha). The validation of spatial distribution maps against independent ground measured yield data (30% of the total samples) confirms that yield was predicted with high accuracy (see Figure 7b). The high R 2 (0.83) and low RMSE (0.14 tons/ha) manifested a close match between measured and predicted rice yields (Figure 7b).

Discussion
The field data (Table 2) reflect that the measured rice yield production (minimum (3.06 ton/ha), mean (3.7 ton/ha), maximum 4.15 (ton/ha) in the study area is within the limits of rice crop statistics within the country (i.e., ranging from 2.4 ton/ha to 10 ton/ ha). These numbers are far less than the statistics of the neighboring countries (i.e., China, Vietnam, Bangladesh) and could be attributed to the variety of rice [45], uneven water usage, weeds and pest attacks, and post-harvest loses (e.g., shattering and improper drying and storing etc.). The super basmati grown in the study area produces high quality rice and is famous for its aromatic fragrance; however, it is less productive compared to other hybrid varieties.
The time series profiles show that vegetation indices (Figure 8a) are minimum at the transplantation stage and gradually increase in vegetative phase (tillering, panicle, and flowering stages). A decline was observed in the vegetation indices after post flowering phases (e.g., dough, ripening). This initial increase in vegetation index values could be associated with increase in leaf area coverage (LAI, biomass) and the post flowering phase decline could be attributed to the senescing of rice crop. The temporal variation in vegetation indices (in this study) are in line with the findings of previous studies [46,47], where the peak greenness is achieved at flowering/milk phases and decline was observed in dough stages and reaching minimum at ripening phase ( Figure 8).

Discussion
The field data ( Table 2) reflect that the measured rice yield production (minimum (3.06 ton/ha), mean (3.7 ton/ha), maximum 4.15 (ton/ha) in the study area is within the limits of rice crop statistics within the country (i.e., ranging from 2.4 ton/ha to 10 ton/ ha). These numbers are far less than the statistics of the neighboring countries (i.e., China, Vietnam, Bangladesh) and could be attributed to the variety of rice [45], uneven water usage, weeds and pest attacks, and post-harvest loses (e.g., shattering and improper drying and storing etc.). The super basmati grown in the study area produces high quality rice and is famous for its aromatic fragrance; however, it is less productive compared to other hybrid varieties.
The time series profiles show that vegetation indices (Figure 8a) are minimum at the transplantation stage and gradually increase in vegetative phase (tillering, panicle, and flowering stages). A decline was observed in the vegetation indices after post flowering phases (e.g., dough, ripening). This initial increase in vegetation index values could be associated with increase in leaf area coverage (LAI, biomass) and the post flowering phase decline could be attributed to the senescing of rice crop. The temporal variation in vegetation indices (in this study) are in line with the findings of previous studies [46,47], where the peak greenness is achieved at flowering/milk phases and decline was observed in dough stages and reaching minimum at ripening phase ( Figure 8). The integration of PLSR and time series vegetation indices accurately predicted rice yields (the maximum R 2 = 0.83 and least RMSE = 0.12 ton/ha (3.12% of the mean yield). The slightly better performance of NDVI (relative to SAVI, EVI, and REP) could be attributed to the canopy characteristics (e.g., structural, vegetation percent cover) of rice crop. The abundance of stem and leaf blades of rice crop obscures the background soil visibility and leads to enhanced vegetation (rice) reflectance signals, thus allowing slope-based-index (such as NDVI) to perform more precise estimates of rice yield [48,49].
Using PLSR regression, the prediction accuracy enhances with augmenting the number of variables (NDVI, EVI, SAVI, and REP) and leads to high R 2 and low RMSE until the model stabilized at a certain point. In this study, the PLSR model saturate at the addition of six latent variables (Figure 6a,d,g,j) and captured maximum variance present in dataset. The addition of further variables hardly improves the prediction of rice yield and displays a flat line [50]. The selected latent variables belong to late vegetative, reproductive (panicle, flowering, milky), and ripening phases (Figure 6b,e,h,k) and reflects the critical importance of these growth stages in rice yield production. The picking of late vegetative phase may be associated with an increase in biomass and leaf covered area (leaves and shoots fully develop at this stage and gain maximum crop height). The high correlation at reproductive phases could be associated with the formation panicle, flowering, milk, dough, and maturity of grain, which directly influence the crop yields. The results of this study are also consistent with the findings of existing literature, where booting stage highly influences the rice yield production [51]. The outcomes of this study help in estimating the rice yield and highlight the critical phases in the life cycle (of rice crop) where monitoring and human intervention (such as usage of water and agrochemicals) can enhance the yield production.

Conclusions
This study aims to accurately quantify rice yield and identify the critical growth stages that influence the rice yield production. The integration of PLSR and time series vegetation indices (i.e., spanning across the entire rice crop growth period) results in accurate predictions of rice yield. Among vegetation indices, NDVI performs the best (yield high R 2 and low RMSE) followed by EVI, SAVI, and REP. Using the time stamped vegetation indices, the PLSR coefficients identified the growth stages that influence the rice yield. The growth stages (selected latent variables) belong to late vegetative, reproductive (panicle, flowering, milky), and ripening phases. The selected critical growth stages were common in all four types of vegetation indices (i.e., NDVI, EVI, SAVI, and REP) used. This study concludes that PLSR can effectively be used for rice yield estimation and identifying critical stages of the rice growth cycle. The precise yield estimation (rice in this case) allows decision makers to strategize policy regarding yield import and export. The outcome of this study can also help the farmers to monitor rice at critical time spans and allow them to intervene (e.g., usage of water, fertilizer, pesticides etc.) in a timely manner. The timely interventions thus help in producing more yields which in turn is essential for minimizing hunger (SDG 2), alleviating poverty (SDG 1), ensuring land (SDG 15) and food security.
However, the present study did not compare the accuracy of PLS algorithm with artificial neural networks, support vector machines, other geo-statistics, etc., for yield forecasting. These would be interesting directions for future study.